Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate vLLM #143

Open
ChezzPlaya opened this issue Aug 1, 2023 · 13 comments
Open

Integrate vLLM #143

ChezzPlaya opened this issue Aug 1, 2023 · 13 comments
Labels
enhancement New feature or request

Comments

@ChezzPlaya
Copy link

Are you planning to integrate the vLLM package for fast LLM inference and serving?

https://vllm.readthedocs.io/en/latest/

@lbeurerkellner lbeurerkellner added the enhancement New feature or request label Aug 1, 2023
@lbeurerkellner
Copy link
Collaborator

lbeurerkellner commented Aug 1, 2023

Yes, we definitely want to add a corresponding LMTP backend. However, we will wait until vLLM adds logit_bias support, which is crucial to make LMQL's constraining work. See also the vLLM GH for progress on that:

@benbot
Copy link

benbot commented Oct 6, 2023

I know it's on their roadmap, but am I wrong in thinking that this (https://github.com/vllm-project/vllm/blob/acbed3ef40f015fcf64460e629813922fab90380/vllm/model_executor/layers/sampler.py#L94C33-L94C33) is logit bias?

It looks like it's already implemented in there.

@maximegmd
Copy link

Any news regarding this? It would really help with non-openai models

@jhallas
Copy link

jhallas commented Nov 15, 2023

I became aware of LMFE today (from newsletter from llama-index). Seems similar to lmql in some ways, although maybe a different approach in ways I can't determine from cursory look. I did notice that they have a way of integrating with vllm (although doesn't look like the cleanest approach). Not sure if applicable here: https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_vllm_integration.ipynb

@giorgiopiatti
Copy link

vllm added support for logit processor in 0.2.2, see vllm-project/vllm#1469

@ggbetz
Copy link

ggbetz commented Dec 14, 2023

Hi! vllm could probably be integrated in different ways, e.g.:

  1. vllm client that connects to a vllm server that's set up independently of lmql (like: lmtp_replicate_client)
  2. vllm backend that's part of lmtp model serving (like: llama_cpp_model)
  3. use lmtp openai interface to access vllm proxy openai server (Set up a proxy for OpenAI. #250 ; but logits_bias not ready yet?)

As a user, I'm leaning towards 1, being clean and simple.

What are your preferences?

@lbeurerkellner , what are your thoughts?

(Also pinging: @reuank)

@lbeurerkellner
Copy link
Collaborator

I can see the appeal of 1 to the user (no extra server-side setup and possibility of using third-party infrastructure) and we can definitely support it. However, 2 is the best option from our perspective, as we have some deep server-side improvements coming up, that specifically optimize constrained decoding and LMQL programs, something vLLM is not focusing on of course. Running with vLLM in the same process, will allow a deeper and better performing implementation.

I only strongly oppose 3, as I think the OpenAI API is the most limiting to us, and not something I would want to invest further in, with respect to protocol etc.

@ggbetz
Copy link

ggbetz commented Dec 15, 2023

Thanks, that makes sense to me. Is it correct that implementing option 2 essentially involves the following steps?

  • Create vllm_model module in models/lmtp/backend
  • Create class VllmModel(LMTPModel) (in analogy to LlamaCppModel)
  • Implement methods with vllm offline inference
  • Register model class

@wdhitchc
Copy link

That makes sense to me and and would provide a clean path forward. Do we get all the benefits of VLLM when its used in offline inference? Paged attention for example. I would just use LLama.cpp backend, but in a more GPU rich environment with many concurrent users VLLM outperforms.

@ggbetz
Copy link

ggbetz commented Dec 16, 2023

Do we get all the benefits of VLLM when its used in offline inference?

Good question. vllm docstring explains:

[LLMEngine] is the main class for the vLLM engine. It receives requests from clients and generates texts from the LLM. It includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). This class utilizes iteration-level scheduling and efficient memory management to maximize the serving throughput.

The LLM class wraps this class for offline batched inference and the AsyncLLMEngine class wraps this class for online serving.

Accordingly, if I'm not mistaken, the answer is yes and we'd get all the benefits with offline inference. (Experts, please correct me if I'm wrong.)

@jbohnslav
Copy link

As a user, I'd strongly prefer option 3., as that would allow me to seamlessly switch between OpenAI and vLLM in my application. It would also let me run a server for many uses, independent of LMQL, instead of having it all in one monolith.

@reuank
Copy link
Contributor

reuank commented Jan 11, 2024

Hey @lbeurerkellner,
are you aware of anyone currently working on this?
Otherwise, I will have a look at the approach @ggbetz described (adding a new vLLM backend, similar to llama_cpp_model).

@lbeurerkellner
Copy link
Collaborator

I am not aware of anyone actively working on this, so feel free to go ahead :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants