-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate vLLM #143
Comments
Yes, we definitely want to add a corresponding LMTP backend. However, we will wait until vLLM adds logit_bias support, which is crucial to make LMQL's constraining work. See also the vLLM GH for progress on that: |
I know it's on their roadmap, but am I wrong in thinking that this (https://github.com/vllm-project/vllm/blob/acbed3ef40f015fcf64460e629813922fab90380/vllm/model_executor/layers/sampler.py#L94C33-L94C33) is logit bias? It looks like it's already implemented in there. |
Any news regarding this? It would really help with non-openai models |
I became aware of LMFE today (from newsletter from llama-index). Seems similar to lmql in some ways, although maybe a different approach in ways I can't determine from cursory look. I did notice that they have a way of integrating with vllm (although doesn't look like the cleanest approach). Not sure if applicable here: https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_vllm_integration.ipynb |
vllm added support for logit processor in 0.2.2, see vllm-project/vllm#1469 |
Hi! vllm could probably be integrated in different ways, e.g.:
As a user, I'm leaning towards 1, being clean and simple. What are your preferences? @lbeurerkellner , what are your thoughts? (Also pinging: @reuank) |
I can see the appeal of 1 to the user (no extra server-side setup and possibility of using third-party infrastructure) and we can definitely support it. However, 2 is the best option from our perspective, as we have some deep server-side improvements coming up, that specifically optimize constrained decoding and LMQL programs, something vLLM is not focusing on of course. Running with vLLM in the same process, will allow a deeper and better performing implementation. I only strongly oppose 3, as I think the OpenAI API is the most limiting to us, and not something I would want to invest further in, with respect to protocol etc. |
Thanks, that makes sense to me. Is it correct that implementing option 2 essentially involves the following steps?
|
That makes sense to me and and would provide a clean path forward. Do we get all the benefits of VLLM when its used in offline inference? Paged attention for example. I would just use LLama.cpp backend, but in a more GPU rich environment with many concurrent users VLLM outperforms. |
Good question. vllm docstring explains:
Accordingly, if I'm not mistaken, the answer is yes and we'd get all the benefits with offline inference. (Experts, please correct me if I'm wrong.) |
As a user, I'd strongly prefer option 3., as that would allow me to seamlessly switch between OpenAI and vLLM in my application. It would also let me run a server for many uses, independent of LMQL, instead of having it all in one monolith. |
Hey @lbeurerkellner, |
I am not aware of anyone actively working on this, so feel free to go ahead :) |
Are you planning to integrate the vLLM package for fast LLM inference and serving?
https://vllm.readthedocs.io/en/latest/
The text was updated successfully, but these errors were encountered: