Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Sep 23, 2025

Cutlass MLA has a block_size of 128 so following #25290 this would assert since default max_num_seqs is 1024

also make sure we capture the size of the up-projection in the profile run (this was in v0 backend but failed to make it to v0)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an assertion failure in the Multi-Level Attention (MLA) implementation by replacing a strict assert with a max() function to ensure a minimum workspace size. This change makes the code more robust against certain configurations. However, this fix introduces a critical issue: the newly calculated chunked_prefill_workspace_size may no longer be divisible by dcp_world_size, which is a requirement enforced by a subsequent assertion and is crucial for the correctness of distributed computations. I have provided a comment detailing the issue and a suggested fix to prevent this new potential crash.

Comment on lines 486 to 488
self.chunked_prefill_workspace_size = max(
self.chunked_prefill_workspace_size,
scheduler_config.max_num_seqs * cache_config.block_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly replaces the assert with max() to ensure a minimum workspace size. However, this can introduce a new critical issue.

The chunked_prefill_workspace_size is required to be divisible by dcp_world_size when dcp_world_size > 1, as enforced by an assert on line 495. This is necessary for the correctness of the all-gather workspace calculation later on (line 1368).

Your change can result in chunked_prefill_workspace_size not being a multiple of dcp_world_size, which will cause the assertion on line 495 to fail.

For example, if dcp_world_size = 3 and scheduler_config.max_num_seqs * cache_config.block_size is not a multiple of 3, this will lead to a crash.

To fix this, you should round up chunked_prefill_workspace_size to the nearest multiple of dcp_world_size when dcp_world_size > 1.

A possible fix would be to add the following inside the if self.dcp_world_size > 1: block, before the assertion at line 495:

self.chunked_prefill_workspace_size = cdiv(
    self.chunked_prefill_workspace_size, self.dcp_world_size
) * self.dcp_world_size

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Comment on lines 485 to 488
# Enforce that we enough for at least 1 page per request
self.chunked_prefill_workspace_size = max(
self.chunked_prefill_workspace_size,
scheduler_config.max_num_seqs * cache_config.block_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cutlass MLA case block size = 128, if max_num_seqs == 1024 it is back to 128 * 1024 again, so seems that we will meet an OOM issue again?

@smarterclayton
Copy link
Contributor

This allowed me to start a vllm using deepseek v3.1 again (DP=16, B200)

simon-mo
simon-mo previously approved these changes Sep 23, 2025
@simon-mo simon-mo dismissed their stale review September 23, 2025 15:28

Deferring to Wentao

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 23, 2025
@robertgshaw2-redhat robertgshaw2-redhat merged commit 9df8da5 into main Sep 24, 2025
56 checks passed
@robertgshaw2-redhat robertgshaw2-redhat deleted the lwilkinson/fix-mla-assert branch September 24, 2025 01:09
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: gaojc <1055866782@qq.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants