[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling #25184

alexm-redhat · 2025-09-18T16:51:39Z

This PR removes the need to pad cutlass mla inputs (q_nope and q_pe) to max_heads==128 by pre-padding the buffers that are used by previous operations. Also, the PR improves the way v_proj handles the output by reusing the output buffer earlier inside torch.bmm. For DeepSeekR1 on 8xB200 batch_size==32, decode iteration TPOT performance imporves from 18.87 to 18.25ms, about 3.3%.

Verified correctness with: lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-R1-0528,tensor_parallel_size=8 --tasks gsm8k --num_fewshot 5 --batch_size auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9538	±	0.0058
		strict-match	5	exact_match	↑	0.9515	±	0.0059

gemini-code-assist

Code Review

This pull request introduces performance optimizations for MLA by pre-padding inputs and optimizing the v_proj output handling. The changes are well-aligned with the performance improvement goals. I have identified two areas for improvement. First, in vllm/v1/attention/backends/mla/common.py, the _v_up_proj method uses an unsafe resize_ on a tensor view, which should be refactored for safety and clarity. Second, vllm/v1/attention/backends/mla/cutlass_mla.py contains temporary, commented-out code that should be removed before merging to improve maintainability. Addressing these points will enhance the code's quality and robustness.

vllm/v1/attention/backends/mla/common.py

vllm/v1/attention/backends/mla/cutlass_mla.py

vllm/v1/attention/backends/mla/common.py

vllm/v1/attention/backends/mla/cutlass_mla.py

mergify · 2025-09-18T20:15:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin

Looks good to me, just one nit on types

vllm/v1/attention/backends/mla/common.py

…tput reshape Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: charlifu <charlifu@amd.com>

…tput handling (#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: gaojc <1055866782@qq.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

alexm-redhat requested review from WoosukKwon, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 18, 2025 16:51

alexm-redhat requested review from LucasWilkinson and mgoin September 18, 2025 16:51

mergify bot added the v1 label Sep 18, 2025

alexm-redhat requested a review from ProExpertProg September 18, 2025 16:53

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mla/cutlass_mla.py Outdated Show resolved Hide resolved

alexm-redhat force-pushed the opt_mla_2 branch from be624a7 to 5add586 Compare September 18, 2025 16:58

alexm-redhat self-assigned this Sep 18, 2025

alexm-redhat force-pushed the opt_mla_2 branch from 5d49947 to 678db9b Compare September 18, 2025 18:10

mgoin reviewed Sep 18, 2025

View reviewed changes

mergify bot added the needs-rebase label Sep 18, 2025

alexm-redhat mentioned this pull request Sep 19, 2025

[torch.compile][Performance]: Unwrap custom ops and improve fusion (Inductor and custom) #24629

Open

1 task

mergify bot added the documentation Improvements or additions to documentation label Sep 22, 2025

alexm-redhat force-pushed the opt_mla_2 branch from 9d254a8 to 53e50fd Compare September 22, 2025 19:38

mgoin approved these changes Sep 22, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Outdated Show resolved Hide resolved

[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…

8bf1e11

…tput reshape Signed-off-by: Alexander Matveev <amatveev@redhat.com>

alexm-redhat force-pushed the opt_mla_2 branch from 53e50fd to 8bf1e11 Compare September 22, 2025 20:02

mergify bot removed the needs-rebase label Sep 22, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

mgoin merged commit 0b7bed9 into main Sep 23, 2025
55 of 56 checks passed

mgoin deleted the opt_mla_2 branch September 23, 2025 01:20

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…

1eaf2c6

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…

dbb029c

…tput handling (#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…

7292c90

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Performance] Remove input pads in cutlass_mla and optimize v_proj ou…

340b707

…tput handling (vllm-project#25184) Signed-off-by: Alexander Matveev <amatveev@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling #25184

[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling #25184

Uh oh!

alexm-redhat commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling #25184

[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling #25184

Uh oh!

Conversation

alexm-redhat commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexm-redhat commented Sep 18, 2025 •

edited by github-actions bot

Loading