Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][VLM] Add precise multi-modal placeholder tracking #8346

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

petersalas
Copy link
Contributor

@petersalas petersalas commented Sep 10, 2024

Currently, multi-modal prompt placeholders are related to the multi-modal embeddings exclusively by index, i.e. the first placeholder token in the prompt must correspond to the first MM embedding vector, etc. This adds a mechanism for tracking multi-modal placeholder ranges precisely which allows multi-modal models to be used with chunked prefill and is a prerequisite for allowing multi-modal models to be used with prefix caching enabled (see #8348).

For a model to use precise tracking, it:

  • Sets the placeholder ranges on the LLM inputs (these should align 1-1 with the multi-modal items for the matching key)
  • Uses the computed placeholder/embedding index tensors from the AttentionMetadata to merge the embeddings instead of matching on token ID.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

tests/multimodal/test_utils.py Outdated Show resolved Hide resolved
vllm/model_executor/models/clip.py Outdated Show resolved Hide resolved
vllm/model_executor/models/ultravox.py Outdated Show resolved Hide resolved
vllm/multimodal/image.py Outdated Show resolved Hide resolved
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some new comments.

Let's also test whether chunked prefill works with multimodal data for online serving since only prompt_token_ids are passed in that case.

vllm/inputs/data.py Show resolved Hide resolved
vllm/inputs/registry.py Outdated Show resolved Hide resolved
vllm/sequence.py Outdated Show resolved Hide resolved
@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 24, 2024

As a preliminary test, I ran the VLM CI against this PR. Please take a look at the CI failures and fix them.

@DarkLight1337
Copy link
Member

PTAL at the failing workers test as well.

@DarkLight1337
Copy link
Member

Please merge from main again to resolve the test failures.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 29, 2024

It looks like the format of multi_modal_data inputted to the input mapper has been changed by you to always be in multi-input form (i.e. list of single input instances), breaking some of the models that don't allow multi-input.

vllm/multimodal/base.py Show resolved Hide resolved
vllm/multimodal/base.py Show resolved Hide resolved
intersection = range(max(positions.start, placeholder.start),
min(positions.stop, placeholder.stop))

if not intersection:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a real use case where intersection is an empty set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a couple scenarios:

  • If prefix caching is enabled (following integration with [Core][VLM] Add support for placeholder token content hashes #8348) we can skip multi-modal handling altogether for any multi-modal items whose corresponding blocks are cached.
  • In chunked prefill, if a multi-modal item is in a section of the prompt that isn't currently being prefilled it can be ignored for that inference.

@@ -466,6 +484,7 @@ def build(self, seq_lens: List[int], query_lens: List[int],
num_prefill_tokens=self.num_prefill_tokens,
num_decode_tokens=num_decode_tokens,
seq_lens=seq_lens,
multi_modal_placeholder_maps=placeholder_maps,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code part is using this added parameter here? Ditto for all other attention changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any model that uses merge_multimodal_embeddings_from_map -- this change only adds it for Ultravox but that's really only to limit scope. In time I'd expect any multi-modal model that merges multi-modal and text embeddings to use merge_multimodal_embeddings_from_map instead of merge_multimodal_embeddings (barring any tradeoffs I might be neglecting).

inputs_embeds = merge_multimodal_embeddings(
input_ids, inputs_embeds, audio_embeddings,
_AUDIO_PLACEHOLDER_TOKEN)
merge_multimodal_embeddings_from_map(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "merge_multimodal_embeddings_from_map" is used only for ultravox model, where this kind of merging happens for the other models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other comment -- in time I'd expect usage of merge_multimodal_embeddings to be replaced with merge_multimodal_embeddings_from_map.

Copy link
Collaborator

@alexm-neuralmagic alexm-neuralmagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

@petersalas
Copy link
Contributor Author

petersalas commented Oct 10, 2024

@petersalas did you had a chance to do performance comparison of non-chunked vs chunked prompt execution with your changes?

Not rigorously -- I did some ad-hoc runs of offline_inference_audio_language.py at various batch sizes as a smoke test (and came to the conclusion that throughput was slightly worse with chunked prefill than without). Is there a particular setup I should benchmark?

edit: here's the ad-hoc test I ran:

chunked prefill on

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 14.87it/s, est. speed input: 2156.88 toks/s, output: 758.77 toks/s]

chunked prefill off

python examples/offline_inference_audio_language.py --num-prompts 100
Processed prompts: 100%|█| 100/100 [00:06<00:00, 15.38it/s, est. speed input: 2230.79 toks/s, output: 813.09 toks/s]

Copy link
Contributor

@afeldman-nm afeldman-nm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left a few nits but overall LGTM.

vllm/multimodal/base.py Outdated Show resolved Hide resolved
vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved
vllm/multimodal/base.py Outdated Show resolved Hide resolved
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petersalas Thanks for this great work and apologies for the very delayed review! I have left comments/suggestions so please take a look!

I previously discussed with @WoosukKwon on how we want to support chunked prefill for multimodal models and we landed on the similar idea as this PR, so this is definitely on the right track.

A few things to note from this PR:

  • With this PR and chunked prefill enabled, embedding generation through encoder and projector for a single multimodal data item can be executed multiple times unnecessarily, especially if the total budget of a batch is small. Therefore, we're essentially trading off full decoder prefill + one-time encoder prefill for chunked decoder prefill + potential repeated encoder prefill.

    This should be okay for now since the encoder call is not expensive compared to the decoder (I wonder if you have profiled this tradeoff yourself?), but eventually we should definitely design a cache system for these multimodal embeddings so that we only need to generate them once and extract needed portions every step.

  • The PR only enables models that use repeat_and_pad_placeholder_tokens. I think later we should decouple the ranges generation from this function to provide some flexibility, or provide another interface where the developer can provide their own ranges(I've been thinking about the best way to do this especially with our re-arch work.)

  • We also need to update both user and developer documentation for how to enable their models with chunked prefill, but that doesn't need to happen this PR.

Thanks again for this contribution!

vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved
vllm/model_executor/models/blip2.py Outdated Show resolved Hide resolved
vllm/attention/backends/abstract.py Outdated Show resolved Hide resolved
vllm/multimodal/base.py Outdated Show resolved Hide resolved
vllm/multimodal/base.py Show resolved Hide resolved
vllm/multimodal/base.py Outdated Show resolved Hide resolved
vllm/multimodal/base.py Outdated Show resolved Hide resolved
vllm/worker/model_runner.py Show resolved Hide resolved
vllm/attention/backends/utils.py Outdated Show resolved Hide resolved
Copy link

mergify bot commented Oct 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @petersalas please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 30, 2024
Signed-off-by: Peter Salas <peter@fixie.ai>
Signed-off-by: Peter Salas <peter@fixie.ai>
Copy link

mergify bot commented Nov 1, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @petersalas please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants