-
-
Couldn't load subscription status.
- Fork 10.8k
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with variable-length sequences in Qwen3-Next's multi-token prediction implementation, particularly for speculative decoding rollbacks. The changes span across the causal convolution Triton kernel, the Qwen3-Next model file, and the GatedDeltaNet attention backend. While the core logic for handling varlen inputs seems correct, I've identified a critical issue in the attention backend related to CUDA graph batch size calculation and a high-severity issue in the Triton kernel concerning the use of constexpr for runtime variables, which could lead to severe performance degradation or compilation errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing self.decode_cudagraph_max_bs to be a token count (by multiplying with self.num_spec + 1) is incorrect, as this variable is used as a sequence count (batch size) for tensor allocations. For example, self.spec_state_indices_tensor is allocated with this as its first dimension (line 80), which is indexed by sequence, not by token. This change will lead to incorrect tensor allocations (either too large, wasting memory, or too small, causing out-of-bounds errors) and likely runtime failures.
To fix this correctly, decode_cudagraph_max_bs should remain a sequence count. A new variable should be introduced for the maximum token count if needed for the check at line 221.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel parameters state_len and seqlen are declared as tl.constexpr in the function signature (lines 635 and 634 respectively), but they are being reassigned here. constexpr values are meant to be compile-time constants and should not be modified at runtime. Passing runtime values to tl.constexpr parameters causes Triton to recompile the kernel for each unique value, which can lead to significant performance degradation and long compilation times. This reassignment is also confusing and can lead to unexpected behavior.
To fix this, you should change their type hints in the kernel signature to int. Additionally, for better code clarity and to avoid modifying input parameters, it's recommended to use new local variables for the updated values.
…mentation. Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
50380ad to
61bb8b8
Compare
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> (cherry picked from commit 8b83d23259ac24ec1f3e5e012da0c997a90031d8)
…mentation Fixes CUDA illegal memory access errors during Qwen3-Next speculative decoding by implementing proper varlen sequence handling and CUDA graph batch size fixes. Key changes from upstream PR vllm-project#24957: - Enhanced GDNAttentionMetadata with num_actual_tokens field - Fixed CUDA graph batch size calculation for speculative decoding scenarios - Added varlen sequence support to causal_conv1d operations - Improved token accounting across MTP verification paths Resolves issues with: - Multi-token prediction verification with unaligned speculative tokens - Variable-length sequence processing in continuous batching - CUDA memory allocation errors in graph capture Co-authored-by: upstream contributors from PR vllm-project#24957 Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Nice!!!! @sighingnow |
@chaunceyjiang Thanks for the help to verify and the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix!
|
thanks for the fix! |
…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: charlifu <charlifu@amd.com>
…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR fixes the corner cases where guided decoding backend rollbacks draft tokens, causing unaligned verify batches.
Fixes #24730.
Fixes #24881.