Skip to content

Add support for encoder embedding models using MultiModal args #20026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

maxdebayser
Copy link
Contributor

This PR is a follow-up to #16188 . It adds support for encoder models which don't have a causal attention mask. It also adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model. In contrast with PR #19988, here the support for token_type_ids is added using multimodal arguments. The advantage is that all modifications are encapsulated in bert.py and roberta.py. The downside is that the multimodal framework wasn't really meant for this and the implementation is a bit hacky.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added frontend v1 tpu Related to Google TPUs labels Jun 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @maxdebayser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances vLLM's capability to serve encoder-only and cross-encoder models by integrating their specific input requirements, such as token_type_ids, through the MultiModal framework. It also includes crucial adaptations to the attention and KV cache mechanisms to properly handle non-causal attention and optimize performance for these model types, alongside expanding test coverage.

Highlights

  • Encoder Model Support: I've added comprehensive support for encoder-only embedding models, including those that do not use a causal attention mask and require token_type_ids for proper tokenization. This expands the range of models that can be efficiently served by vLLM.
  • MultiModal Framework Integration: The handling of token_type_ids for models like cross-encoders has been integrated into the existing MultiModal arguments framework. This encapsulates the necessary modifications primarily within bert.py and roberta.py, streamlining the data flow for these specific inputs.
  • Attention and KV Cache Adaptations: The attention backend (FlashAttentionImpl) now supports ENCODER_ONLY attention types and dynamically applies causal masks based on the attention type. Furthermore, the KV cache mechanism has been updated to support prefix caching for ENCODER_ONLY models, while disabling advanced features like prefix caching and chunked prefill for general non-decoder attention layers due to current limitations.
  • Testing Enhancements: Test coverage for encoder and cross-encoder models has been significantly improved. Many skip_v1 marks were removed from embedding tests, and an autouse fixture was introduced to ensure relevant tests run with both v0 and v1 engines by default.
  • Known Limitations: It's important to note that CUDA graphs are currently not supported for pooling operations, leading to a fallback to eager mode. Additionally, for models with non-decoder attention layers, prefix caching and chunked prefill are explicitly disabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds support for encoder embedding models using MultiModal args. The changes include modifications to tests, configurations, and model implementations for BERT and RoBERTa. The code introduces new classes and functionalities for handling token type IDs and multi-modal processing. The changes also include updates to attention backends and KV cache management. The code is generally well-structured, but some areas could benefit from additional comments to clarify the logic and reasoning behind certain decisions.

Comment on lines +720 to +722
logger.warning("CUDA graph is not supported for pooling yet, "
"fallback to the eager mode.")
self.enforce_eager = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This warning message and the subsequent enforce_eager setting seem specific to the pooling functionality. Consider encapsulating this logic within the if self.runner_type == "pooling" block to avoid unintended side effects for other model types.

            if isinstance(self.override_pooler_config, dict):
                self.override_pooler_config = PoolerConfig(
                    **self.override_pooler_config)

            # WIP: currently cuda graphs are not working for encoder models.
            logger.warning("CUDA graph is not supported for pooling yet, "
                           "fallback to the eager mode.")
            self.enforce_eager = True

Comment on lines +80 to +84
if input_ids is not None and position_ids is not None \
and inputs_embeds is None and token_type_ids is None:
token_type_ids = torch.zeros(input_ids.size(),
dtype=torch.long,
device=input_ids.device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This conditional logic seems a bit unusual. It checks if input_ids, position_ids are not None and inputs_embeds, token_type_ids are None, and if so, creates token_type_ids. It might be helpful to add a comment explaining the purpose of this check and the scenario it addresses.

Suggested change
if input_ids is not None and position_ids is not None \
and inputs_embeds is None and token_type_ids is None:
token_type_ids = torch.zeros(input_ids.size(),
dtype=torch.long,
device=input_ids.device)
# forward was called directly without going
# throught the multi-modal flow
if input_ids is not None and position_ids is not None \
and inputs_embeds is None and token_type_ids is None:
# Create token_type_ids if not provided

Comment on lines +60 to +64
zero_pos = torch.where(position_ids == 0)[0]
end_pos = torch.cat((zero_pos[1:],
torch.tensor([position_ids.shape[0]],
device=zero_pos.device)))
seq_lens = end_pos - zero_pos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic replaces position ids, but it's not immediately clear why. Adding a brief explanation of the issue and the approach would improve readability.

        # Replace position ids because in RoBERTa models
        # they have to start at padding_idx + 1 and ignore
        # existing padding tokens

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant