-
-
Couldn't load subscription status.
- Fork 10.9k
[BugFix] Fix MLA assert with CUTLASS MLA #25478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an assertion failure in the Multi-Level Attention (MLA) implementation by replacing a strict assert with a max() function to ensure a minimum workspace size. This change makes the code more robust against certain configurations. However, this fix introduces a critical issue: the newly calculated chunked_prefill_workspace_size may no longer be divisible by dcp_world_size, which is a requirement enforced by a subsequent assertion and is crucial for the correctness of distributed computations. I have provided a comment detailing the issue and a suggested fix to prevent this new potential crash.
| self.chunked_prefill_workspace_size = max( | ||
| self.chunked_prefill_workspace_size, | ||
| scheduler_config.max_num_seqs * cache_config.block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly replaces the assert with max() to ensure a minimum workspace size. However, this can introduce a new critical issue.
The chunked_prefill_workspace_size is required to be divisible by dcp_world_size when dcp_world_size > 1, as enforced by an assert on line 495. This is necessary for the correctness of the all-gather workspace calculation later on (line 1368).
Your change can result in chunked_prefill_workspace_size not being a multiple of dcp_world_size, which will cause the assertion on line 495 to fail.
For example, if dcp_world_size = 3 and scheduler_config.max_num_seqs * cache_config.block_size is not a multiple of 3, this will lead to a crash.
To fix this, you should round up chunked_prefill_workspace_size to the nearest multiple of dcp_world_size when dcp_world_size > 1.
A possible fix would be to add the following inside the if self.dcp_world_size > 1: block, before the assertion at line 495:
self.chunked_prefill_workspace_size = cdiv(
self.chunked_prefill_workspace_size, self.dcp_world_size
) * self.dcp_world_size| # Enforce that we enough for at least 1 page per request | ||
| self.chunked_prefill_workspace_size = max( | ||
| self.chunked_prefill_workspace_size, | ||
| scheduler_config.max_num_seqs * cache_config.block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cutlass MLA case block size = 128, if max_num_seqs == 1024 it is back to 128 * 1024 again, so seems that we will meet an OOM issue again?
|
This allowed me to start a vllm using deepseek v3.1 again (DP=16, B200) |
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Cutlass MLA has a block_size of 128 so following #25290 this would assert since default
max_num_seqsis 1024also make sure we capture the size of the up-projection in the profile run (this was in v0 backend but failed to make it to v0)