-
-
Notifications
You must be signed in to change notification settings - Fork 11k
remove attn output view kernel #26680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Boyuan Feng <boyuan@meta.com>
dcbcac6 to
8e7c62a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes attention output tensor allocation by using torch.empty instead of torch.zeros, avoiding an unnecessary kernel launch and improving performance. A new configuration init_attn_out is introduced to maintain backward compatibility for attention backends that require zero-initialized output tensors. The changes are logical and well-motivated. I have one suggestion to improve code clarity and maintainability in vllm/attention/layer.py by refactoring the output tensor shape calculation to avoid variable reuse, which can be error-prone.
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com>
|
cc @WoosukKwon I think you added the zeroinit attention? |
|
I think it was my mistake that we zero-initialize the buffer for every forward pass. My intent was to do it only when the whole attention operation is skipped, like profiling run. I think we can move vllm/vllm/v1/attention/backends/flash_attn.py Lines 488 to 490 in 314285d
I think it's still important to keep the buffers away from NaN, because some kernels could potentially err with it. Previously, I met this issue with a custom MoE kernel whose top-k and routing kernel doesn't handle NaNs well. |
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
@WoosukKwon thanks for the info! I added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just check the attention fusion pass? I think it might rely on the fact that we use zeros for attention output
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jonah Bernard <jb2528@cornell.edu>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Boyuan Feng <fby.1994@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Before this PR, attention output is allocated and initialized with 0 (due to
torch.zeros), and the view into a shape, before the output tensor is used by any other ops. This becomes a triton kernel of ~1 us latency, which is on-par with a rope/layer norm (~1.6us) latency.This PR changes to allocate with
torch.emptywhich only allocates the tensor and does not initialize it. This allocation will be removed by cudagraph so it is free.As a result, this PR removes the attn_out_view kernel at the end of this qwen3-0.6b trace.

See #26682 (comment) for perf win.