-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use outlines.processors
and SequenceGeneratorAdapter
for outlines.models.vllm
#1053
Conversation
d5b901e
to
9190b04
Compare
9190b04
to
715ed49
Compare
if hasattr(self.model, "get_tokenizer"): | ||
tokenizer = self.model.get_tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up, for the AyncLLMEngine
(as shown in the outlines
vLLM example), this will return a coroutine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py#L506 .
I'm trying to figure out the best path forward because I'd love to use this with my vLLM-based service, but it seems like this work is part of something bigger so I don't want to dive in and start propagating async
through this code without checking in with you first. Happy to contribute, but could use a little guidance on the strategy 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's uncertain whether we will move towards async for outlines.generate
, but it has been proposed #655 Currently outlines.generate
with outlines.models.vllm
uses a vllm.LLM
Bare in mind that outlines.serve
already has a vllm server integration and vice versa, vllm has an outlines.processors
integration in progress
Does outlines.serve
or vLLM's outlines integration satisfy your needs, or were you thinking of something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, thank you for the sanity check! After re-reviewing the outlines.serve
code, I realized I didn't go deep enough and needed to pass my engine.engine
(engines all the way down 🐢) to get all the way to the vllm.LLM
. Thanks again for the pointers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Bare in mind that after our next major release (because of this PR), the tokenizer, not the engine will be passed to the processor. serve.py
has a PR to reflect this behavior https://github.com/outlines-dev/outlines/pull/1061/files#diff-535a1da5f8addb89d07782185c32b54f85189b25786d1c9b7cbd002b55939e16R74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted! Will keep an eye out for that. Thanks again for everything; super excited for the awesome capabilities you all have enabled with outlines
!
Fixes:
models.vllm
being the only model (other thanmodels.exllamav2
) usingSequenceGenerator
Changes
outlines.generate
handlers, default toSequenceGeneratorAdapter
for all models exceptExLlamaV2Model
OutlinesLogitsProcessors
to allow vLLMinput_ids
which are of typetuple
FSMLogitsProcessor
bug: unable to handle batch sequences where prompt ids are part ofinput_ids
. Wasn't caught previously becausemodel_llamacpp
cannot perform batch generation.Tests
tests/generate/test_generate.py
model_vllm
model_vllm
andmodel_transformers_vision
only if cuda is availableBenchmarks
Regex logits processing performance has changed in an acceptable manner
torch
:268±7μs
->225±1μs
numpy
:160±0.9μs
->185±3μs
Benchmarks that have stayed the same:
Performance degradation detected!