-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) #22928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> refactor Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> refactor Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> an we remove Optional as this config should never be None Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> wip Signed-off-by: Daniel Afrimi <dafrimi@cw-dfw-cs-001-login-03.cm.cluster> Signed-off-by: <> remove optional Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for using float32 for the Mamba model's state, which is particularly useful for hybrid models. The changes are well-structured, adding a new configuration option mamba_ssm_cache_dtype and plumbing it through the system from argument parsing to the cache and model execution layers. The implementation correctly handles the new data type for both convolutional and SSM states in Mamba layers. The tests have been updated to cover this new functionality. My main feedback is to remove some debugging print statements left in the code.
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
tomeras91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Left a few nit comments
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
6c5269b to
b61b5a6
Compare
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Head branch was pushed to by a user without write access
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
@tdoublep I think this change broke plamo2 (vLLM V0) and possibly phi4flash too |
|
Oh right, yeah it will be broken. I will fix it - thanks for reporting. |
|
phi4flash too Error |
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…2, Mamba1, Minimax) (vllm-project#22928) Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Purpose
This PR adds support for customizing the dtype of the state stored for models based on mamba/mamba2/linear attention. It introduces a new CLI argument
mamba_ssm_cache_dtypewhich is by default set toautobut can also be set tofloat32. If set tofloat32, for mamba2 the temporal (ssm) state will be stored in fp32, but the conv state will take whatever value is set bykv_cache_dtype(e.g., if left as auto it will take the model dtype). For mamba1 the same behaviour is expected, but there are some kernel changes needed to enable fp32 state (cc @asafgardin) for mamba1 so I'm raising if that gets set tofloat32for now. For linear attention, there is only one state tensor so that case is simpler.Note: for mamba2 we need to take special care when layout out the two state tensors in GPU memory since they have different dtypes, and the calculation of the page size also needs to change accordingly.
This PR builds on previous unmerged PRs:
PS This PR is not as big as it looks :) Most of the changes in the modeling code are the same.
Test Plan
I added a new test to
test_hybridthat verifies the behaviour for V0 and V1 are the same, and that the output matches hf transformers. Transformers doesn't actually have support for this yet, so I'm hoping for now that the output is basically the same for a short prompt.Test Result
The test passes. I've also tested a lot of other models locally but don't want to bloat the CI.
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.closes #22544
closes #22913