-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Roadmap] vLLM Development Roadmap: H2 2023 #244
Comments
Is the Quantized models Supporting under developing? |
This would be very helpfull @zhuohan123. Thank you very much for the state of the art performance in inference! |
Can we get function calling to match openai api feature on the roadmap? Not entirely sure what the implementation for that looks like, but it's a very useful feature. |
I have a prototype implementation of OpenAI-like function calling. It works well on advanced models (like Llama 2). Please let me know if this is something the team would consider taking in as part of vllm. |
@mondaychen Yes, how about you submit a PR? |
@zhisbug OK! I'll polish my prototype and submit a PR |
Need to support Baichuan2 |
Here is an implementation of function calling with huggingface's models, could be helpful: https://local-llm-function-calling.readthedocs.io/en/latest/quickstart.html |
Need to support Qwen-14b |
Need to support Phi-1 and Phi-1.5 |
Possible to support CPU too? |
-yuan |
Hey @zhouyuan @WoosukKwon, I'd like to get this new variant of concurrent-LORA serving added to the roadmap: concurrent LORA serving: |
Is there any plans to support functions like OpenAI? I know this task is complex as the parsing of llm output will be custom for each fine-tuned model depending on the training data. However, perhaps it would be possible to add a module/function that you can inject into For example functionary has copied some of vllm and extended/customised it to support functions In the future, when hopefully, more open-source models with function calling capabilities are released, it would be great if one does not have to clone a repository for each model but instead if the particular parsing was supported by vllm. What thoughts are there on this matter? I wouldn't mind contributing to such a feature... |
Also interesting in this question |
We have deprecated this roadmap. Please find our latest roadmap in #2681 |
Removed duplicated lines and suppressed the most noisy warning that hides the actual important ones
We summarize the issues we received and our planned features in this issue. This issue will keep being updated.
Latest issue tracked: #677
Software Quality
Installation
Installation
labelDocumentation
forward
function. How integrate with hf with minial modification? #242New Models
Decoder-only models
Encoder-decoder models
Other techniques:
Frontend Features
vLLM demo frontends:
prompt
as alist
instead ofstr
#186 Possibility of Passing Prompts as List[str] to AsyncEngine.generate() #279ChatCompletion
Endpoint SupportChatCompletion
Endpoint in OpenAI demo server #311logit_bias
[Feature] Add support forlogit_bias
#379 I want use the function prefix_allowed_tokens_fn of huggingface model.generate(), where of vllm's source code shall I modify? #415Integration with other frontends:
prompt
as alist
instead ofstr
#186 Langchain/LLAMA_INDEX #553Engine Optimization and New Features
Kernels
8-bit quantization
support #214 Not able to used qlora models with vllm #252 8bit support #295 support for quantized models? #316 Loading quantized models #392Bugs
Bug
labelThe text was updated successfully, but these errors were encountered: