Add support for encoder embedding models using MultiModal args #20026

maxdebayser · 2025-06-24T15:33:57Z

This PR is a follow-up to #16188 . It adds support for encoder models which don't have a causal attention mask. It also adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model. In contrast with PR #19988, here the support for token_type_ids is added using multimodal arguments. The advantage is that all modifications are encapsulated in bert.py and roberta.py. The downside is that the multimodal framework wasn't really meant for this and the implementation is a bit hacky.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

github-actions · 2025-06-24T15:34:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @maxdebayser, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances vLLM's capability to serve encoder-only and cross-encoder models by integrating their specific input requirements, such as token_type_ids, through the MultiModal framework. It also includes crucial adaptations to the attention and KV cache mechanisms to properly handle non-causal attention and optimize performance for these model types, alongside expanding test coverage.

Highlights

Encoder Model Support: I've added comprehensive support for encoder-only embedding models, including those that do not use a causal attention mask and require token_type_ids for proper tokenization. This expands the range of models that can be efficiently served by vLLM.
MultiModal Framework Integration: The handling of token_type_ids for models like cross-encoders has been integrated into the existing MultiModal arguments framework. This encapsulates the necessary modifications primarily within bert.py and roberta.py, streamlining the data flow for these specific inputs.
Attention and KV Cache Adaptations: The attention backend (FlashAttentionImpl) now supports ENCODER_ONLY attention types and dynamically applies causal masks based on the attention type. Furthermore, the KV cache mechanism has been updated to support prefix caching for ENCODER_ONLY models, while disabling advanced features like prefix caching and chunked prefill for general non-decoder attention layers due to current limitations.
Testing Enhancements: Test coverage for encoder and cross-encoder models has been significantly improved. Many skip_v1 marks were removed from embedding tests, and an autouse fixture was introduced to ensure relevant tests run with both v0 and v1 engines by default.
Known Limitations: It's important to note that CUDA graphs are currently not supported for pooling operations, leading to a fallback to eager mode. Additionally, for models with non-decoder attention layers, prefix caching and chunked prefill are explicitly disabled.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds support for encoder embedding models using MultiModal args. The changes include modifications to tests, configurations, and model implementations for BERT and RoBERTa. The code introduces new classes and functionalities for handling token type IDs and multi-modal processing. The changes also include updates to attention backends and KV cache management. The code is generally well-structured, but some areas could benefit from additional comments to clarify the logic and reasoning behind certain decisions.

gemini-code-assist · 2025-06-24T15:36:33Z

vllm/config.py

+            logger.warning("CUDA graph is not supported for pooling yet, "
+                           "fallback to the eager mode.")
+            self.enforce_eager = True


This warning message and the subsequent enforce_eager setting seem specific to the pooling functionality. Consider encapsulating this logic within the if self.runner_type == "pooling" block to avoid unintended side effects for other model types.

if isinstance(self.override_pooler_config, dict): self.override_pooler_config = PoolerConfig( **self.override_pooler_config) # WIP: currently cuda graphs are not working for encoder models. logger.warning("CUDA graph is not supported for pooling yet, " "fallback to the eager mode.") self.enforce_eager = True

gemini-code-assist · 2025-06-24T15:36:33Z

vllm/model_executor/models/bert.py

+        if input_ids is not None and position_ids is not None \
+            and inputs_embeds is None and token_type_ids is None:
+            token_type_ids = torch.zeros(input_ids.size(),
+                                         dtype=torch.long,
+                                         device=input_ids.device)


This conditional logic seems a bit unusual. It checks if input_ids, position_ids are not None and inputs_embeds, token_type_ids are None, and if so, creates token_type_ids. It might be helpful to add a comment explaining the purpose of this check and the scenario it addresses.

Suggested change

if input_ids is not None and position_ids is not None \

and inputs_embeds is None and token_type_ids is None:

token_type_ids = torch.zeros(input_ids.size(),

dtype=torch.long,

device=input_ids.device)

# forward was called directly without going

# throught the multi-modal flow

if input_ids is not None and position_ids is not None \

and inputs_embeds is None and token_type_ids is None:

# Create token_type_ids if not provided

gemini-code-assist · 2025-06-24T15:36:33Z

vllm/model_executor/models/roberta.py

+        zero_pos = torch.where(position_ids == 0)[0]
+        end_pos = torch.cat((zero_pos[1:],
+                             torch.tensor([position_ids.shape[0]],
+                                          device=zero_pos.device)))
+        seq_lens = end_pos - zero_pos


This logic replaces position ids, but it's not immediately clear why. Adding a brief explanation of the issue and the approach would improve readability.

# Replace position ids because in RoBERTa models # they have to start at padding_idx + 1 and ignore # existing padding tokens

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser added 2 commits June 23, 2025 11:33

Add support for encoder embedding models

5dee54d

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Use multi-modal support to pass token_type_ids to the model

b430dba

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added frontend v1 tpu Related to Google TPUs labels Jun 24, 2025

gemini-code-assist bot reviewed Jun 24, 2025

View reviewed changes

reduce diff

aad1052

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add support for encoder embedding models using MultiModal args #20026

Add support for encoder embedding models using MultiModal args #20026

Uh oh!

maxdebayser commented Jun 24, 2025

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

gemini-code-assist bot Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add support for encoder embedding models using MultiModal args #20026

Are you sure you want to change the base?

Add support for encoder embedding models using MultiModal args #20026

Uh oh!

Conversation

maxdebayser commented Jun 24, 2025

Uh oh!

github-actions bot commented Jun 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!