-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add user-configurable task for models that support both generation and embedding #9424
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
d9a07f8
to
8ec8d5c
Compare
Mainly looking for review from @robertgshaw2-neuralmagic since you're also working on embedding models. Also @simon-mo or @njhill for signing off on this CLI/API change. |
vllm/config.py
Outdated
@@ -33,14 +33,21 @@ | |||
_EMBEDDING_MODEL_MAX_NUM_BATCHED_TOKENS = 32768 | |||
_MULTIMODAL_MODEL_MAX_NUM_BATCHED_TOKENS = 5120 | |||
|
|||
Task = Literal["generate", "embed"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your thought about using enums instead of strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use strings unless there is a need to attach additional data for each task. Otherwise, we will have to spend additional effort converting between string and enum at the API level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think auto should default to generate, even if embed is also possible. We can put a warning that this is ambiguous, but I'm wary of this growing to most models. This seems like it would break current flows for users that are using models, like microsoft/Phi-3-vision-128k-instruct
Ultimately my stance is that generation models should be first-class citizens and the expansion of other tasks should be opt-in
Good point, I have updated this so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on default to generate
vllm/entrypoints/llm.py
Outdated
*args: Never, | ||
task: TaskOption = "auto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm this is a bigger change and a bit risky, can we revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the deprecation into a decorator (vllm.utils.deprecate_args
) where the original args are directly passed through to this function. This should eliminate any chances of a breaking change.
9b7746d
to
ecad240
Compare
ecad240
to
0a24bd3
Compare
@mgoin does this look good to you now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, this looks pretty good to me now.
I have the question if we could default task = "auto" in ModelConfig and SchedulerConfig as well. It's okay if you feel strongly about this, it just seems a bit unnecessary to require explicit specification
@DarkLight1337 somehow this PR failed some speculative decoding tests. For example:
Before this PR merged (7dbe738 in main branch):
After this PR merged (051eaf6 in main branch):
|
…ation and embedding (vllm-project#9424) Signed-off-by: charlifu <charlifu@amd.com>
…ation and embedding (vllm-project#9424) Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ation and embedding (vllm-project#9424) Signed-off-by: Alvant <alvasian@yandex.ru>
…ation and embedding (vllm-project#9424) Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
…ation and embedding (vllm-project#9424) Signed-off-by: qishuai <ferdinandzhong@gmail.com>
…ation and embedding (vllm-project#9424) Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
…ation and embedding (vllm-project#9424)
…ation and embedding (vllm-project#9424) Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
…ation and embedding (vllm-project#9424) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Follow-up to #9303
This PR adds a
--task
option which is used to determine which model runner (for generation or embedding) to create when initializing vLLM. The default (auto
) will select the first task which the model supports. In the case of embedding models which share the same model architecture as the base model, users can explicitly specify which task to initialize vLLM for, e.g.:(Also addresses #6282 (comment).)
Since the new
task
option is semantically related tomodel
argument, I've placed it right aftermodel
, beforetokenizer
. To avoid incompatibilities resulting from this, I have added backwards compatibility and deprecated the usage of positional arguments apart frommodel
inLLM.__init__
.Note: This will introduce a breaking change for VLM2Vec users as they currently do not have to pass
--task embedding
due to the hardcoded embedding model detection logic. Nevertheless, requiring the user to set the mode explicitly is more maintainable in the long run as the number of embedding models increases.