Skip to content

Conversation

@ZJY0516
Copy link
Contributor

@ZJY0516 ZJY0516 commented Nov 6, 2025

Purpose

partially fix #27571

In decoding phase with cuda garaph, we will pad for pre-captured cudagraph size.
This makes batch don't equal to attn_metadata.num_decodes and trigger assertion error in causal_conv1d_update

mixed_qkv_non_spec = causal_conv1d_update(
                mixed_qkv_non_spec,
                conv_state,
                conv_weights,
                self.conv1d.bias,
                self.activation,
                conv_state_indices=non_spec_state_indices_tensor[
                    : attn_metadata.num_decodes
                ],
                validate_data=True,
            )
# inside causal_conv1d_update
if conv_state_indices is None:
    assert conv_state.size(0) >= batch
else:
    assert (batch,) == conv_state_indices.shape

Test Plan

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 4 -dp 2
vllm bench serve \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--dataset-name random \
--tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct \
--num-prompts 512 \
--random-input-len 2048 \
--random-output-len 1024 --request-rate 30
lm_eval --model local-chat-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=280 --tasks gsm8k --apply_chat_template --num_fewshot 5
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5967 ± 0.0135
strict-match 5 exact_match 0.4170 ± 0.0136

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 4
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7839 ± 0.0113
strict-match 5 exact_match 0.6611 ± 0.0130

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
@ZJY0516 ZJY0516 requested a review from sighingnow as a code owner November 6, 2025 09:23
@mergify mergify bot added the qwen Related to Qwen models label Nov 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in the qwen3-next model's Qwen3NextGatedDeltaNet layer. The change correctly adjusts the slicing of non_spec_state_indices_tensor by using attn_metadata.num_actual_tokens instead of attn_metadata.num_decodes. This is a critical fix for scenarios involving CUDA graph capture, where tensors are padded to a fixed size. The original code could lead to shape mismatches and assertion failures, while the new code ensures the tensor size is correct, preventing potential crashes. The fix is accurate and necessary for robust model execution.

@ZJY0516 ZJY0516 changed the title [Bugfix] fix qwen3-next ima [Bugfix] fix qwen3-next crash Nov 6, 2025
@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Nov 6, 2025

It seems this pr has accuracy issue

lm_eval --model local-completions --model_args model=Qwen/Qwen3-
Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.3480 ± 0.0131
strict-match 5 exact_match 0.2782 ± 0.0123

@vadiklyutiy vadiklyutiy self-requested a review November 6, 2025 11:48
@vadiklyutiy
Copy link
Collaborator

at first glance it looks like the right change ...

@vadiklyutiy
Copy link
Collaborator

@ZJY0516 could you describe what happens in #27571? Why does it cause illegal memory access?

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Nov 6, 2025

Sometimes it will crash for illegal memory access and sometimes for assertion error.

I think the root cause is assert (batch,) == conv_state_indices.shape @vadiklyutiy

@vadiklyutiy
Copy link
Collaborator

Could you check lm_eval with this PR changes but without -dp

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Nov 6, 2025

Could you check lm_eval with this PR changes but without -dp

much worse

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0 ± 0
strict-match 5 exact_match 0 ± 0

This is most likely due to the Triton kernel cache

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Nov 7, 2025

I tested it on 2 H200 and there is no problem now. Could you please help to test this PR on you machine? @vadiklyutiy

vllm serve /data/datasets/models-hf/Qwen3-Next-80B-A3B-Instruct --served-model-name Qwen/Qwen3-Next-80B-A3B-Instruct -tp 2 --enable-expert-parallel --compilation-config '{"cudagraph_mode": "NONE"}'

lm_eval --model local-chat-completions --model_args model=Qwen/Qwen3-Next-80B-A3BInstruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=280 --tasks gsm8k --apply_chat_template --num_fewshot 5
Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7892 ± 0.0112
strict-match 5 exact_match 0.6649 ± 0.0130

@vadiklyutiy
Copy link
Collaborator

--no-enable-prefix-caching is missed

@ZJY0516
Copy link
Contributor Author

ZJY0516 commented Nov 10, 2025

After merging from main

lm_eval --model local-chat-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=280 --tasks gsm8k --apply_chat_template --num_fewshot 5

this pr

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.5967 ± 0.0135
strict-match 5 exact_match 0.4170 ± 0.0136

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@heheda12345 heheda12345 enabled auto-merge (squash) November 11, 2025 04:13
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 11, 2025
@heheda12345 heheda12345 merged commit f0359ff into vllm-project:main Nov 11, 2025
60 checks passed
@ZJY0516 ZJY0516 deleted the fix-q3n branch November 13, 2025 09:39
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: qwen3-next failed with CUDA error: an illegal memory access was encountered

3 participants