-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][VLM] Add precise multi-modal placeholder tracking #8346
base: main
Are you sure you want to change the base?
[Core][VLM] Add precise multi-modal placeholder tracking #8346
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
e457298
to
aa756bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some new comments.
Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids
are passed in that case.
As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them. |
PTAL at the failing workers test as well. |
Please merge from |
It looks like the format of |
intersection = range(max(positions.start, placeholder.start), | ||
min(positions.stop, placeholder.stop)) | ||
|
||
if not intersection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a real use case where intersection is an empty set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a couple scenarios:
- If prefix caching is enabled (following integration with [Core][VLM] Add support for placeholder token content hashes #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.
- In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.
@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int], | |||
num_prefill_tokens=self.num_prefill_tokens, | |||
num_decode_tokens=num_decode_tokens, | |||
seq_lens=seq_lens, | |||
multi_modal_placeholder_maps=placeholder_maps, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What code part is using this added parameter here? Ditto for all other attention changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any model that uses merge_multimodal_embeddings_from_map
-- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map
instead of merge_multimodal_embeddings
(barring any tradeoffs I might be neglecting).
inputs_embeds = merge_multimodal_embeddings( | ||
input_ids, inputs_embeds, audio_embeddings, | ||
_AUDIO_PLACEHOLDER_TOKEN) | ||
merge_multimodal_embeddings_from_map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the other comment -- in time I'd expect usage of merge_multimodal_embeddings
to be replaced with merge_multimodal_embeddings_from_map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?
Not rigorously -- I did some ad-hoc runs of edit: here's the ad-hoc test I ran: chunked prefill on
chunked prefill off
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left a few nits but overall LGTM.
84d12e7
to
366d65a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petersalas Thanks for this great work and apologies for the very delayed review! I have left comments/suggestions so please take a look!
I previously discussed with @WoosukKwon on how we want to support chunked prefill for multimodal models and we landed on the similar idea as this PR, so this is definitely on the right track.
A few things to note from this PR:
-
With this PR and chunked prefill enabled, embedding generation through encoder and projector for a single multimodal data item can be executed multiple times unnecessarily, especially if the total budget of a batch is small. Therefore, we're essentially trading off
full decoder prefill + one-time encoder prefill
forchunked decoder prefill + potential repeated encoder prefill
.This should be okay for now since the encoder call is not expensive compared to the decoder (I wonder if you have profiled this tradeoff yourself?), but eventually we should definitely design a cache system for these multimodal embeddings so that we only need to generate them once and extract needed portions every step.
-
The PR only enables models that use
repeat_and_pad_placeholder_tokens
. I think later we should decouple theranges
generation from this function to provide some flexibility, or provide another interface where the developer can provide their ownranges
(I've been thinking about the best way to do this especially with our re-arch work.) -
We also need to update both user and developer documentation for how to enable their models with chunked prefill, but that doesn't need to happen this PR.
Thanks again for this contribution!
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Peter Salas <peter@fixie.ai>
366d65a
to
acbc1bf
Compare
This pull request has merge conflicts that must be resolved before it can be |
Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).
For a model to use precise tracking, it:
AttentionMetadata
to merge the embeddings instead of matching on token ID.