Integrate vLLM #143

ChezzPlaya · 2023-08-01T08:32:40Z

Are you planning to integrate the vLLM package for fast LLM inference and serving?

https://vllm.readthedocs.io/en/latest/

lbeurerkellner · 2023-08-01T09:09:59Z

Yes, we definitely want to add a corresponding LMTP backend. However, we will wait until vLLM adds logit_bias support, which is crucial to make LMQL's constraining work. See also the vLLM GH for progress on that:

benbot · 2023-10-06T03:31:34Z

I know it's on their roadmap, but am I wrong in thinking that this (https://github.com/vllm-project/vllm/blob/acbed3ef40f015fcf64460e629813922fab90380/vllm/model_executor/layers/sampler.py#L94C33-L94C33) is logit bias?

It looks like it's already implemented in there.

maximegmd · 2023-11-02T19:34:50Z

Any news regarding this? It would really help with non-openai models

jhallas · 2023-11-15T00:13:58Z

I became aware of LMFE today (from newsletter from llama-index). Seems similar to lmql in some ways, although maybe a different approach in ways I can't determine from cursory look. I did notice that they have a way of integrating with vllm (although doesn't look like the cleanest approach). Not sure if applicable here: https://github.com/noamgat/lm-format-enforcer/blob/main/samples/colab_vllm_integration.ipynb

giorgiopiatti · 2023-11-25T11:33:53Z

vllm added support for logit processor in 0.2.2, see vllm-project/vllm#1469

ggbetz · 2023-12-14T07:39:49Z

Hi! vllm could probably be integrated in different ways, e.g.:

vllm client that connects to a vllm server that's set up independently of lmql (like: lmtp_replicate_client)
vllm backend that's part of lmtp model serving (like: llama_cpp_model)
use lmtp openai interface to access vllm proxy openai server (Set up a proxy for OpenAI. #250 ; but logits_bias not ready yet?)

As a user, I'm leaning towards 1, being clean and simple.

What are your preferences?

@lbeurerkellner , what are your thoughts?

(Also pinging: @reuank)

lbeurerkellner · 2023-12-14T16:05:03Z

I can see the appeal of 1 to the user (no extra server-side setup and possibility of using third-party infrastructure) and we can definitely support it. However, 2 is the best option from our perspective, as we have some deep server-side improvements coming up, that specifically optimize constrained decoding and LMQL programs, something vLLM is not focusing on of course. Running with vLLM in the same process, will allow a deeper and better performing implementation.

I only strongly oppose 3, as I think the OpenAI API is the most limiting to us, and not something I would want to invest further in, with respect to protocol etc.

ggbetz · 2023-12-15T07:37:34Z

Thanks, that makes sense to me. Is it correct that implementing option 2 essentially involves the following steps?

Create vllm_model module in models/lmtp/backend
Create class VllmModel(LMTPModel) (in analogy to LlamaCppModel)
Implement methods with vllm offline inference
Register model class

wdhitchc · 2023-12-15T18:05:49Z

That makes sense to me and and would provide a clean path forward. Do we get all the benefits of VLLM when its used in offline inference? Paged attention for example. I would just use LLama.cpp backend, but in a more GPU rich environment with many concurrent users VLLM outperforms.

ggbetz · 2023-12-16T09:47:54Z

Do we get all the benefits of VLLM when its used in offline inference?

Good question. vllm docstring explains:

[LLMEngine] is the main class for the vLLM engine. It receives requests from clients and generates texts from the LLM. It includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). This class utilizes iteration-level scheduling and efficient memory management to maximize the serving throughput.

The LLM class wraps this class for offline batched inference and the AsyncLLMEngine class wraps this class for online serving.

Accordingly, if I'm not mistaken, the answer is yes and we'd get all the benefits with offline inference. (Experts, please correct me if I'm wrong.)

jbohnslav · 2023-12-22T14:05:02Z

As a user, I'd strongly prefer option 3., as that would allow me to seamlessly switch between OpenAI and vLLM in my application. It would also let me run a server for many uses, independent of LMQL, instead of having it all in one monolith.

reuank · 2024-01-11T13:24:19Z

Hey @lbeurerkellner,
are you aware of anyone currently working on this?
Otherwise, I will have a look at the approach @ggbetz described (adding a new vLLM backend, similar to llama_cpp_model).

lbeurerkellner · 2024-01-16T20:22:30Z

I am not aware of anyone actively working on this, so feel free to go ahead :)

lbeurerkellner added the enhancement New feature or request label Aug 1, 2023

0xc1c4da mentioned this issue Sep 13, 2023

Support custom OpenAI-compatible endpoints like VLLM #204

Closed

This was referenced Oct 17, 2023

Constrained decoding vllm-project/vllm#1243

Draft

Support for Constrained decoding vllm-project/vllm#288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate vLLM #143

Integrate vLLM #143

ChezzPlaya commented Aug 1, 2023

lbeurerkellner commented Aug 1, 2023 •

edited

Loading

benbot commented Oct 6, 2023 •

edited

Loading

maximegmd commented Nov 2, 2023

jhallas commented Nov 15, 2023

giorgiopiatti commented Nov 25, 2023

ggbetz commented Dec 14, 2023

lbeurerkellner commented Dec 14, 2023

ggbetz commented Dec 15, 2023

wdhitchc commented Dec 15, 2023

ggbetz commented Dec 16, 2023

jbohnslav commented Dec 22, 2023

reuank commented Jan 11, 2024

lbeurerkellner commented Jan 16, 2024

Integrate vLLM #143

Integrate vLLM #143

Comments

ChezzPlaya commented Aug 1, 2023

lbeurerkellner commented Aug 1, 2023 • edited Loading

benbot commented Oct 6, 2023 • edited Loading

maximegmd commented Nov 2, 2023

jhallas commented Nov 15, 2023

giorgiopiatti commented Nov 25, 2023

ggbetz commented Dec 14, 2023

lbeurerkellner commented Dec 14, 2023

ggbetz commented Dec 15, 2023

wdhitchc commented Dec 15, 2023

ggbetz commented Dec 16, 2023

jbohnslav commented Dec 22, 2023

reuank commented Jan 11, 2024

lbeurerkellner commented Jan 16, 2024

lbeurerkellner commented Aug 1, 2023 •

edited

Loading

benbot commented Oct 6, 2023 •

edited

Loading