-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
Add support for encoder embedding models using MultiModal args #20026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @maxdebayser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances vLLM's capability to serve encoder-only and cross-encoder models by integrating their specific input requirements, such as token_type_ids
, through the MultiModal framework. It also includes crucial adaptations to the attention and KV cache mechanisms to properly handle non-causal attention and optimize performance for these model types, alongside expanding test coverage.
Highlights
- Encoder Model Support: I've added comprehensive support for encoder-only embedding models, including those that do not use a causal attention mask and require
token_type_ids
for proper tokenization. This expands the range of models that can be efficiently served by vLLM. - MultiModal Framework Integration: The handling of
token_type_ids
for models like cross-encoders has been integrated into the existing MultiModal arguments framework. This encapsulates the necessary modifications primarily withinbert.py
androberta.py
, streamlining the data flow for these specific inputs. - Attention and KV Cache Adaptations: The attention backend (
FlashAttentionImpl
) now supportsENCODER_ONLY
attention types and dynamically applies causal masks based on the attention type. Furthermore, the KV cache mechanism has been updated to support prefix caching forENCODER_ONLY
models, while disabling advanced features like prefix caching and chunked prefill for general non-decoder attention layers due to current limitations. - Testing Enhancements: Test coverage for encoder and cross-encoder models has been significantly improved. Many
skip_v1
marks were removed from embedding tests, and anautouse
fixture was introduced to ensure relevant tests run with both v0 and v1 engines by default. - Known Limitations: It's important to note that CUDA graphs are currently not supported for pooling operations, leading to a fallback to eager mode. Additionally, for models with non-decoder attention layers, prefix caching and chunked prefill are explicitly disabled.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds support for encoder embedding models using MultiModal args. The changes include modifications to tests, configurations, and model implementations for BERT and RoBERTa. The code introduces new classes and functionalities for handling token type IDs and multi-modal processing. The changes also include updates to attention backends and KV cache management. The code is generally well-structured, but some areas could benefit from additional comments to clarify the logic and reasoning behind certain decisions.
logger.warning("CUDA graph is not supported for pooling yet, " | ||
"fallback to the eager mode.") | ||
self.enforce_eager = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning message and the subsequent enforce_eager
setting seem specific to the pooling functionality. Consider encapsulating this logic within the if self.runner_type == "pooling"
block to avoid unintended side effects for other model types.
if isinstance(self.override_pooler_config, dict):
self.override_pooler_config = PoolerConfig(
**self.override_pooler_config)
# WIP: currently cuda graphs are not working for encoder models.
logger.warning("CUDA graph is not supported for pooling yet, "
"fallback to the eager mode.")
self.enforce_eager = True
if input_ids is not None and position_ids is not None \ | ||
and inputs_embeds is None and token_type_ids is None: | ||
token_type_ids = torch.zeros(input_ids.size(), | ||
dtype=torch.long, | ||
device=input_ids.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional logic seems a bit unusual. It checks if input_ids
, position_ids
are not None and inputs_embeds
, token_type_ids
are None, and if so, creates token_type_ids
. It might be helpful to add a comment explaining the purpose of this check and the scenario it addresses.
if input_ids is not None and position_ids is not None \ | |
and inputs_embeds is None and token_type_ids is None: | |
token_type_ids = torch.zeros(input_ids.size(), | |
dtype=torch.long, | |
device=input_ids.device) | |
# forward was called directly without going | |
# throught the multi-modal flow | |
if input_ids is not None and position_ids is not None \ | |
and inputs_embeds is None and token_type_ids is None: | |
# Create token_type_ids if not provided |
zero_pos = torch.where(position_ids == 0)[0] | ||
end_pos = torch.cat((zero_pos[1:], | ||
torch.tensor([position_ids.shape[0]], | ||
device=zero_pos.device))) | ||
seq_lens = end_pos - zero_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
This PR is a follow-up to #16188 . It adds support for encoder models which don't have a causal attention mask. It also adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model. In contrast with PR #19988, here the support for token_type_ids is added using multimodal arguments. The advantage is that all modifications are encapsulated in bert.py and roberta.py. The downside is that the multimodal framework wasn't really meant for this and the implementation is a bit hacky.