-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Model][BugFix] Mamba/Jamba exceed mamba cache slots #11414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
causing an error inside the mamba cache manager, setting the max num seqs as twice as the max batch size, ensures that new requests will have spare space in the mamba cache manager Signed-off-by: mzusman <mor.zusmann@gmail.com>
Signed-off-by: mzusman <mor.zusmann@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should find a robust solution to this issue. Could you describe the situation that causes this a bit more (e.g. does it only happen for multistep or when running chunked prefill?).
Do you have a command to repro this problem? The commit where this started happening would help nail exactly what's happening
This pull request has merge conflicts that must be resolved before it can be |
Dumb question, does this mean vLLM no longer respects |
I just noticed that the code to determine the max batch size (used for allocating the Mamba state) is a little sketchy when CUDA graphs are used: vllm/vllm/model_executor/models/mamba.py Lines 205 to 213 in bffddd9
Could this be the problem? (I've simplified it in my Mamba2 support PR (#9292) |
In recent vLLM versions (since v0.6.4) running requests are not capped anymore by
scheduler_config.max_num_seqs
which causes an issue on Jamba/Mamba models on high loads.Mamba models keeps a state inside the modeling file and defines the max running sequences/slots as
max_num_seqs
. exceeding this number of slots causes an error that crashes vLLM.to solve that, I've introduced an envar that multiplies the mamba cache slots by a certain amount (x1.5 by default) .
The default is x1.5 as I've seen it's quite sufficient in my tests.
CC @tlrmchlsmth