[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957

sighingnow · 2025-09-16T09:54:39Z

This PR fixes the corner cases where guided decoding backend rollbacks draft tokens, causing unaligned verify batches.

gemini-code-assist

Code Review

This pull request aims to fix an issue with variable-length sequences in Qwen3-Next's multi-token prediction implementation, particularly for speculative decoding rollbacks. The changes span across the causal convolution Triton kernel, the Qwen3-Next model file, and the GatedDeltaNet attention backend. While the core logic for handling varlen inputs seems correct, I've identified a critical issue in the attention backend related to CUDA graph batch size calculation and a high-severity issue in the Triton kernel concerning the use of constexpr for runtime variables, which could lead to severe performance degradation or compilation errors.

gemini-code-assist · 2025-09-16T09:57:46Z

vllm/v1/attention/backends/gdn_attn.py

Changing self.decode_cudagraph_max_bs to be a token count (by multiplying with self.num_spec + 1) is incorrect, as this variable is used as a sequence count (batch size) for tensor allocations. For example, self.spec_state_indices_tensor is allocated with this as its first dimension (line 80), which is indexed by sequence, not by token. This change will lead to incorrect tensor allocations (either too large, wasting memory, or too small, causing out-of-bounds errors) and likely runtime failures.

To fix this correctly, decode_cudagraph_max_bs should remain a sequence count. A new variable should be introduced for the maximum token count if needed for the check at line 221.

gemini-code-assist · 2025-09-16T09:57:46Z

vllm/model_executor/layers/mamba/ops/causal_conv1d.py

The kernel parameters state_len and seqlen are declared as tl.constexpr in the function signature (lines 635 and 634 respectively), but they are being reassigned here. constexpr values are meant to be compile-time constants and should not be modified at runtime. Passing runtime values to tl.constexpr parameters causes Triton to recompile the kernel for each unique value, which can lead to significant performance degradation and long compilation times. This reassignment is also confusing and can lead to unexpected behavior.

To fix this, you should change their type hints in the kernel signature to int. Additionally, for better code clarity and to avoid modifying input parameters, it's recommended to use new local variables for the updated values.

…mentation. Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> (cherry picked from commit 8b83d23259ac24ec1f3e5e012da0c997a90031d8)

…mentation Fixes CUDA illegal memory access errors during Qwen3-Next speculative decoding by implementing proper varlen sequence handling and CUDA graph batch size fixes. Key changes from upstream PR vllm-project#24957: - Enhanced GDNAttentionMetadata with num_actual_tokens field - Fixed CUDA graph batch size calculation for speculative decoding scenarios - Added varlen sequence support to causal_conv1d operations - Improved token accounting across MTP verification paths Resolves issues with: - Multi-token prediction verification with unaligned speculative tokens - Variable-length sequence processing in continuous batching - CUDA memory allocation errors in graph capture Co-authored-by: upstream contributors from PR vllm-project#24957 Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>

chaunceyjiang · 2025-09-17T09:33:54Z

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [01:11<00:00,  1.39it/s]
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  71.97     
Total input tokens:                      204357    
Total generated tokens:                  97082     
Request throughput (req/s):              1.39      
Output token throughput (tok/s):         1348.89   
Total Token throughput (tok/s):          4188.30   
---------------Time to First Token----------------
Mean TTFT (ms):                          540.71    
Median TTFT (ms):                        170.88    
P99 TTFT (ms):                           4819.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.88      
Median TPOT (ms):                        6.29      
P99 TPOT (ms):                           13.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.42     
Median ITL (ms):                         15.50     
P99 ITL (ms):                            127.42    
==================================================

Nice!!!! @sighingnow

sighingnow · 2025-09-17T09:46:25Z

Nice!!!! @sighingnow

@chaunceyjiang Thanks for the help to verify and the feedback!

youkaichao

thanks for the fix!

david6666666 · 2025-09-18T01:06:06Z

thanks for the fix!

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: charlifu <charlifu@amd.com>

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

sighingnow requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tdoublep and ywang96 as code owners September 16, 2025 09:54

mergify bot added qwen Related to Qwen models v1 labels Sep 16, 2025

gemini-code-assist bot reviewed Sep 16, 2025

View reviewed changes

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP imple…

61bb8b8

…mentation. Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

sighingnow force-pushed the fixes/qwen3-next-mtp branch from 50380ad to 61bb8b8 Compare September 16, 2025 10:17

Fixes cuda graph of MTP verify under unaligned sps tokens.

b667c7e

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> (cherry picked from commit 8b83d23259ac24ec1f3e5e012da0c997a90031d8)

This was referenced Sep 17, 2025

[Bug]: Qwen3-Next Fails when running Guided Choice #24881

Closed

[Bug]: 4xH800 Qwen/Qwen3-Next-80B-A3B-Instruct MTP, benchmark failed mixed_qkv_spec.view shape '[5, -1, 2048]' is invalid for input of size 104448 #24730

Closed

youkaichao approved these changes Sep 17, 2025

View reviewed changes

Merge branch 'main' into fixes/qwen3-next-mtp

1ddd9d9

sighingnow enabled auto-merge (squash) September 17, 2025 09:52

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025

youkaichao disabled auto-merge September 17, 2025 13:59

youkaichao merged commit dd6a910 into vllm-project:main Sep 17, 2025
58 of 62 checks passed

sighingnow deleted the fixes/qwen3-next-mtp branch September 18, 2025 04:30

Yang1032 mentioned this pull request Sep 22, 2025

[Bug]: vLLM Worker Process Crash (died unexpectedly) with Qwen3-Next Model when Enabling MTP on NVIDIA A800 #25368

Closed

1 task

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP imple…

f25720d

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP imple…

250ac06

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com> Signed-off-by: charlifu <charlifu@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP imple…

8ce66f0

…mentation. (vllm-project#24957) Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957

Uh oh!

sighingnow commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

gemini-code-assist bot Sep 16, 2025

Uh oh!

chaunceyjiang commented Sep 17, 2025

Uh oh!

sighingnow commented Sep 17, 2025

Uh oh!

youkaichao left a comment

Uh oh!

Uh oh!

david6666666 commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957

[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. #24957

Uh oh!

Conversation

sighingnow commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented Sep 17, 2025

Uh oh!

sighingnow commented Sep 17, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david6666666 commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sighingnow commented Sep 16, 2025 •

edited by github-actions bot

Loading