Skip to content

[Core][VLM] Add precise multi-modal placeholder tracking #8346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 1, 2024

Conversation

petersalas
Copy link
Contributor

@petersalas petersalas commented Sep 10, 2024

Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).

For a model to use precise tracking, it:

  • Sets the placeholder ranges on the LLM inputs (these should align 1-1 with the multi-modal items for the matching key)
  • Uses the computed placeholder/embedding index tensors from the AttentionMetadata to merge the embeddings instead of matching on token ID.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

@ywang96 ywang96 self-assigned this Sep 11, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some new comments.

Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids are passed in that case.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 24, 2024

As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them.

@DarkLight1337
Copy link
Member

PTAL at the failing workers test as well.

@DarkLight1337
Copy link
Member

Please merge from main again to resolve the test failures.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 29, 2024

It looks like the format of multi_modal_data inputted to the input mapper has been changed by you to always be in multi-input form (i.e. list of single input instances), breaking some of the models that don't allow multi-input.

intersection = range(max(positions.start, placeholder.start),
min(positions.stop, placeholder.stop))

if not intersection:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a real use case where intersection is an empty set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a couple scenarios:

  • If prefix caching is enabled (following integration with [Core][VLM] Add support for prefix caching for multi-modal models #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.
  • In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.

@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int],
num_prefill_tokens=self.num_prefill_tokens,
num_decode_tokens=num_decode_tokens,
seq_lens=seq_lens,
multi_modal_placeholder_maps=placeholder_maps,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code part is using this added parameter here? Ditto for all other attention changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any model that uses merge_multimodal_embeddings_from_map -- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map instead of merge_multimodal_embeddings (barring any tradeoffs I might be neglecting).

inputs_embeds = merge_multimodal_embeddings(
input_ids, inputs_embeds, audio_embeddings,
_AUDIO_PLACEHOLDER_TOKEN)
merge_multimodal_embeddings_from_map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comment -- in time I'd expect usage of merge_multimodal_embeddings to be replaced with merge_multimodal_embeddings_from_map.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can i use it in other multimodal models such as llava? Do multi_modal_placeholder_index_maps have other keys besides "audio" use for image?

Copy link
Collaborator

@alexm-redhat alexm-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

@petersalas
Copy link
Contributor Author

petersalas commented Oct 10, 2024

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

Not rigorously -- I did some ad-hoc runs of offline_inference_audio_language.py at various batch sizes as a smoke test (and came to the conclusion that throughput was slightly worse with chunked prefill than without). Is there a particular setup I should benchmark?

edit: here's the ad-hoc test I ran:

chunked prefill on

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 14.87it/s, est. speed input: 2156.88 toks/s, output: 758.77 toks/s]

chunked prefill off

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 15.38it/s, est. speed input: 2230.79 toks/s, output: 813.09 toks/s]

Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left a few nits but overall LGTM.

@petersalas petersalas force-pushed the psalas/placeholder-positions branch from 84d12e7 to 366d65a Compare October 24, 2024 15:52
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalas Thanks for this great work and apologies for the very delayed review! I have left comments/suggestions so please take a look!

I previously discussed with @WoosukKwon on how we want to support chunked prefill for multimodal models and we landed on the similar idea as this PR, so this is definitely on the right track.

A few things to note from this PR:

  • With this PR and chunked prefill enabled, embedding generation through encoder and projector for a single multimodal data item can be executed multiple times unnecessarily, especially if the total budget of a batch is small. Therefore, we're essentially trading off full decoder prefill + one-time encoder prefill for chunked decoder prefill + potential repeated encoder prefill.

    This should be okay for now since the encoder call is not expensive compared to the decoder (I wonder if you have profiled this tradeoff yourself?), but eventually we should definitely design a cache system for these multimodal embeddings so that we only need to generate them once and extract needed portions every step.

  • The PR only enables models that use repeat_and_pad_placeholder_tokens. I think later we should decouple the ranges generation from this function to provide some flexibility, or provide another interface where the developer can provide their own ranges(I've been thinking about the best way to do this especially with our re-arch work.)

  • We also need to update both user and developer documentation for how to enable their models with chunked prefill, but that doesn't need to happen this PR.

Thanks again for this contribution!

Copy link

mergify bot commented Oct 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @petersalas please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Nov 4, 2024
…t#8346)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Richard Liu <ricliu@google.com>
petersalas added a commit to fixie-ai/vllm that referenced this pull request Nov 5, 2024
Signed-off-by: Peter Salas <peter@fixie.ai>
DarkLight1337 pushed a commit that referenced this pull request Nov 6, 2024
Signed-off-by: Peter Salas <peter@fixie.ai>
hissu-hyvarinen pushed a commit to ROCm/vllm that referenced this pull request Nov 6, 2024
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
…t#8346)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>
JC1DA pushed a commit to JC1DA/vllm that referenced this pull request Nov 11, 2024
…t#10045)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Loc Huynh <jc1da.3011@gmail.com>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
…t#8346)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
…t#10045)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
…t#8346)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
…t#10045)

Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants