remove attn output view kernel #26680

BoyuanFeng · 2025-10-13T04:32:33Z

Before this PR, attention output is allocated and initialized with 0 (due to torch.zeros), and the view into a shape, before the output tensor is used by any other ops. This becomes a triton kernel of ~1 us latency, which is on-par with a rope/layer norm (~1.6us) latency.

This PR changes to allocate with torch.empty which only allocates the tensor and does not initialize it. This allocation will be removed by cudagraph so it is free.

As a result, this PR removes the attn_out_view kernel at the end of this qwen3-0.6b trace.

See #26682 (comment) for perf win.

mergify · 2025-10-13T04:33:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BoyuanFeng.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Boyuan Feng <boyuan@meta.com>

gemini-code-assist

Code Review

This pull request optimizes attention output tensor allocation by using torch.empty instead of torch.zeros, avoiding an unnecessary kernel launch and improving performance. A new configuration init_attn_out is introduced to maintain backward compatibility for attention backends that require zero-initialized output tensors. The changes are logical and well-motivated. I have one suggestion to improve code clarity and maintainability in vllm/attention/layer.py by refactoring the output tensor shape calculation to avoid variable reuse, which can be error-prone.

vllm/attention/layer.py

Signed-off-by: Boyuan Feng <boyuan@meta.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com>

Signed-off-by: Boyuan Feng <boyuan@meta.com>

ProExpertProg · 2025-10-13T04:58:54Z

cc @WoosukKwon I think you added the zeroinit attention?

Signed-off-by: Boyuan Feng <boyuan@meta.com>

vllm/config/model.py

WoosukKwon · 2025-10-13T22:13:25Z

@BoyuanFeng @ProExpertProg

I think it was my mistake that we zero-initialize the buffer for every forward pass. My intent was to do it only when the whole attention operation is skipped, like profiling run. I think we can move output.zero_() to here?

vllm/vllm/v1/attention/backends/flash_attn.py

Lines 488 to 490 in 314285d

    
           if attn_metadata is None: 
        
               # Profiling run. 
        
               return output

I think it's still important to keep the buffers away from NaN, because some kernels could potentially err with it. Previously, I met this issue with a custom MoE kernel whose top-k and routing kernel doesn't handle NaNs well.

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng · 2025-10-14T03:32:55Z

@WoosukKwon thanks for the info! I added output.fill_(0) for profiling runs and use torch.empty for other fwd. I checked that this also removes the extra triton kernel.

ProExpertProg

Can you just check the attention fusion pass? I think it might rely on the fact that we use zeros for attention output

Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jonah Bernard <jb2528@cornell.edu>

Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: bbartels <benjamin@bartels.dev>

Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

BoyuanFeng requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 13, 2025 04:32

mergify bot added the needs-rebase label Oct 13, 2025

BoyuanFeng added 2 commits October 12, 2025 21:33

remove attn output view kernel

c41d02f

Signed-off-by: Boyuan Feng <boyuan@meta.com>

nit

8e7c62a

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng force-pushed the bf/remove-out-view branch from dcbcac6 to 8e7c62a Compare October 13, 2025 04:33

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

Merge branch 'main' into bf/remove-out-view

9e7644d

Signed-off-by: Boyuan Feng <boyuan@meta.com>

mergify bot removed the needs-rebase label Oct 13, 2025

BoyuanFeng and others added 3 commits October 12, 2025 21:46

Update vllm/attention/layer.py

8311876

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com>

add doc

769db96

Signed-off-by: Boyuan Feng <boyuan@meta.com>

nit

8875d72

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng mentioned this pull request Oct 13, 2025

use combo kernel to fuse qk-norm and qk-rope #26682

Merged

nit

d6c216d

Signed-off-by: Boyuan Feng <boyuan@meta.com>

hmellor reviewed Oct 13, 2025

View reviewed changes

vllm/config/model.py Outdated Show resolved Hide resolved

move torch.zeros to profiling only

48359d1

Signed-off-by: Boyuan Feng <boyuan@meta.com>

BoyuanFeng requested review from gshtras and tdoublep as code owners October 14, 2025 03:26

mergify bot added the v1 label Oct 14, 2025

nit

550498f

Signed-off-by: Boyuan Feng <boyuan@meta.com>

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025

ProExpertProg approved these changes Oct 14, 2025

View reviewed changes

zou3519 enabled auto-merge (squash) October 14, 2025 18:50

zou3519 approved these changes Oct 14, 2025

View reviewed changes

Merge branch 'main' into bf/remove-out-view

4d16318

zou3519 merged commit a86b4c5 into vllm-project:main Oct 14, 2025
49 checks passed

weijinqian0 mentioned this pull request Oct 22, 2025

[1/N][Refactor] Refactor code to adapt with vllm main vllm-project/vllm-ascend#3612

Merged

ProExpertProg mentioned this pull request Oct 28, 2025

[Feature]: Optimize RoPE #22293

Closed

1 task

ProExpertProg linked an issue Oct 28, 2025 that may be closed by this pull request

[Feature]: Optimize RoPE #22293

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

remove attn output view kernel #26680

remove attn output view kernel #26680

Uh oh!

BoyuanFeng commented Oct 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ProExpertProg commented Oct 13, 2025

Uh oh!

Uh oh!

WoosukKwon commented Oct 13, 2025

Uh oh!

BoyuanFeng commented Oct 14, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

remove attn output view kernel #26680

remove attn output view kernel #26680

Uh oh!

Conversation

BoyuanFeng commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg commented Oct 13, 2025

Uh oh!

Uh oh!

WoosukKwon commented Oct 13, 2025

Uh oh!

BoyuanFeng commented Oct 14, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BoyuanFeng commented Oct 13, 2025 •

edited by github-actions bot

Loading