[Bug]: Qwen3-30B-A3B Shows Precision Issues in DP2+TP2 Parallel Mode

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
CPU:
Architecture:                    aarch64
CPU op-mode(s):                  64-bit

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.post1.dev20250528
[pip3] torchvision==0.20.1
[pip3] transformers==4.52.4
[conda] Could not collect
vLLM Version: 0.9.1
vLLM Ascend Version: 0.9.0rc3.dev33+gdb2f630 (git sha: db2f630)

HDK: 24.1.0.3

CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux



```

</details>


### 🐛 Describe the bug

Apply this pr:  https://github.com/vllm-project/vllm-ascend/pull/1273/files
run this command:
```
python examples/offline_data_parallel.py \
                --model="Qwen3-30B-A3B" \
                --dp-size=2 \
                --tp-size=2 \
                --enforce-eager
```

there is an precision issue with the results in DP rank 0 :
```
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Shin-ji, and I am a 1st grade student. My teacher is a 5'
DP rank 1, Prompt: 'The president of the United States is', Generated text: ' a woman.  ( ) A. Correct B. Incorrect C. Cannot be determined D. None'
DP rank 1, Prompt: 'The capital of France is', Generated text: ' the city where the main office of the French government is located. What is the capital of France?\n\n'
DP rank 1, Prompt: 'The future of AI is', Generated text: ' bright, but not without challenges. The evolution of AI will be shaped by the ethical and legal frameworks'
DP rank 1, Prompt: 'Hello, my name is', Generated text: ' Phoe, and I am a student. It is nice to meet you. (This is a'
Processed prompts: 100%|██████| 200/200 [00:09<00:00, 21.74it/s, est. speed input: 119.58 toks/s, output: 347.87 toks/s]
DP rank 0, Prompt: 'Hello, my name is', Generated text: ', and, and, and, and, and, and, and, and'
DP rank 0, Prompt: 'The president of the United States is', Generated text: ' the the the the the the the the the the the the the the the the'
DP rank 0, Prompt: 'The capital of France is', Generated text: ' the capital of the capital of the capital of the capital of the capital of the'
DP rank 0, Prompt: 'The future of AI is', Generated text: ' now is the the the the the the the the the the the the the the'
DP rank 0, Prompt: 'Hello, my name is', Generated text: ', and, and, and, and, and, and, and, and'

```

Executing the command `python examples/offline_data_parallel.py --model="Qwen3-30B-A3B" --dp-size=2 --tp-size=2 --enable-expert-parallel --enforce-eager`  leads to similar issues. Graph mode exhibits the same issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Qwen3-30B-A3B Shows Precision Issues in DP2+TP2 Parallel Mode #1289

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Qwen3-30B-A3B Shows Precision Issues in DP2+TP2 Parallel Mode #1289

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions