[Bug]: qwen3-next failed with CUDA error: an illegal memory access was encountered

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Alibaba Cloud Linux release 3 (Soaring Falcon)  (x86_64)
GCC version                  : (GCC) 10.2.1 20200825 (Alibaba 10.2.1-3 2.32)
Clang version                : 15.0.7 ( 15.0.7-1.0.3.al8)
CMake version                : version 4.1.2
Libc version                 : glibc-2.32

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Oct  7 2025, 15:34:39) [Clang 20.1.4 ] (64-bit runtime)
Python platform              : Linux-5.10.134-16.3.al8.x86_64-x86_64-with-glibc2.32

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H20-3e
GPU 1: NVIDIA H20-3e
GPU 2: NVIDIA H20-3e
GPU 3: NVIDIA H20-3e
GPU 4: NVIDIA H20-3e
GPU 5: NVIDIA H20-3e
GPU 6: NVIDIA H20-3e
GPU 7: NVIDIA H20-3e

Nvidia driver version        : 570.133.20
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               207
Model name:          INTEL(R) XEON(R) PLATINUM 8575C
Stepping:            2
CPU MHz:             3184.397
CPU max MHz:         4000.0000
CPU min MHz:         800.0000
BogoMIPS:            5600.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            2048K
L3 cache:            327680K
NUMA node0 CPU(s):   0-47,96-143
NUMA node1 CPU(s):   48-95,144-191
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm uintr md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

==============================
Versions of relevant libraries
==============================
[pip3] efficientnet_pytorch==0.7.1
[pip3] flashinfer-python==0.4.1
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.15.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.2.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] open_clip_torch==2.32.0
[pip3] pytorch-lightning==2.5.2
[pip3] pyzmq==27.1.0
[pip3] segmentation_models_pytorch==0.4.0
[pip3] sentence-transformers==3.2.1
[pip3] terratorch==1.0.2
[pip3] torch==2.9.0+cu129
[pip3] torchaudio==2.9.0+cu129
[pip3] torchgeo==0.7.0
[pip3] torchmetrics==1.7.4
[pip3] torchvision==0.24.0+cu129
[pip3] transformers==4.56.2
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.5.0
[pip3] triton_kernels==1.0.0
[pip3] tritonclient==2.51.0
[pip3] vector-quantize-pytorch==1.21.2
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.1rc4.dev17+g361a7463d.d20251027 (git sha: 361a7463d, date: 20251027)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0	N/A
GPU1	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NV18	PIX	NODE	SYS	SYS	0-47,96-143	0	N/A
GPU2	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NV18	NODE	NODE	SYS	SYS	0-47,96-143	0	N/A
GPU3	NV18	NV18	NV18	 X 	NV18	NV18	NV18	NV18	NODE	PIX	SYS	SYS	0-47,96-143	0	N/A
GPU4	NV18	NV18	NV18	NV18	 X 	NV18	NV18	NV18	SYS	SYS	PIX	NODE	48-95,144-191	1	N/A
GPU5	NV18	NV18	NV18	NV18	NV18	 X 	NV18	NV18	SYS	SYS	NODE	NODE	48-95,144-191	1	N/A
GPU6	NV18	NV18	NV18	NV18	NV18	NV18	 X 	NV18	SYS	SYS	NODE	PIX	48-95,144-191	1	N/A
GPU7	NV18	NV18	NV18	NV18	NV18	NV18	NV18	 X 	SYS	SYS	NODE	NODE	48-95,144-191	1	N/A
NIC0	NODE	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	SYS	SYS				
NIC1	NODE	NODE	NODE	PIX	SYS	SYS	SYS	SYS	NODE	 X 	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	PIX	NODE	NODE	NODE	SYS	SYS	 X 	NODE				
NIC3	SYS	SYS	SYS	SYS	NODE	NODE	PIX	NODE	SYS	SYS	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
```

</details>


### 🐛 Describe the bug

```
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --enable-expert-parallel -tp 4 -dp 2
```

```
vllm bench serve \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--dataset-name random \
--tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct \
--num-prompts 512 \
--random-input-len 2048 \
--random-output-len 1024 --request-rate 30
```


<details>
<summary>error log</code></summary>

```text
(APIServer pid=1533589) INFO:     127.0.0.1:57968 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:57976 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:57990 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58004 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58020 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58024 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58034 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58042 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58056 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58060 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO:     127.0.0.1:58068 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1533589) INFO 10-27 20:04:49 [loggers.py:208] Engine 000: Avg prompt throughput: 204.8 tokens/s, Avg generation throughput: 36.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1533589) INFO 10-27 20:04:59 [loggers.py:208] Engine 000: Avg prompt throughput: 12902.1 tokens/s, Avg generation throughput: 60.8 tokens/s, Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1533589) INFO 10-27 20:04:59 [loggers.py:208] Engine 001: Avg prompt throughput: 13107.0 tokens/s, Avg generation throughput: 54.5 tokens/s, Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.2%, Prefix cache hit rate: 0.0%
(APIServer pid=1533589) INFO 10-27 20:05:09 [loggers.py:208] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2802.8 tokens/s, Running: 57 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1533589) INFO 10-27 20:05:09 [loggers.py:208] Engine 001: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2778.1 tokens/s, Running: 58 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 0.0%
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] WorkerProc hit an exception.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Traceback (most recent call last):
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/executor/multiproc_executor.py", line 694, in worker_busy_loop
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     output = func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/worker_base.py", line 353, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_worker.py", line 491, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 2632, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     ) = self._bookkeeping_sync(
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]         ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 2274, in _bookkeeping_sync
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     valid_sampled_token_ids = self._to_list(sampled_token_ids)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 4660, in _to_list
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     self.transfer_event.synchronize()
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     super().synchronize()
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] 
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Traceback (most recent call last):
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/executor/multiproc_executor.py", line 694, in worker_busy_loop
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     output = func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/worker_base.py", line 353, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_worker.py", line 491, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     output = self.model_runner.execute_model(scheduler_output, intermediate_tensors)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     return func(*args, **kwargs)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 2632, in execute_model
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     ) = self._bookkeeping_sync(
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]         ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 2274, in _bookkeeping_sync
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     valid_sampled_token_ids = self._to_list(sampled_token_ids)
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/vllm/v1/worker/gpu_model_runner.py", line 4660, in _to_list
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     self.transfer_event.synchronize()
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]   File "/home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 231, in synchronize
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699]     super().synchronize()
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] 
(Worker_DP0_TP0_EP0 pid=1539019) ERROR 10-27 20:05:10 [multiproc_executor.py:699] 
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.1rc4.dev17+g361a7463d.d20251027) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}, 
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-benchmark-serving0-0', 'cmpl-benchmark-serving2-0', 'cmpl-benchmark-serving4-0', 'cmpl-benchmark-serving6-0', 'cmpl-benchmark-serving8-0', 'cmpl-benchmark-serving10-0', 'cmpl-benchmark-serving14-0', 'cmpl-benchmark-serving16-0', 'cmpl-benchmark-serving18-0', 'cmpl-benchmark-serving20-0', 'cmpl-benchmark-serving22-0', 'cmpl-benchmark-serving24-0', 'cmpl-benchmark-serving26-0', 'cmpl-benchmark-serving29-0', 'cmpl-benchmark-serving30-0', 'cmpl-benchmark-serving32-0', 'cmpl-benchmark-serving36-0', 'cmpl-benchmark-serving37-0', 'cmpl-benchmark-serving40-0', 'cmpl-benchmark-serving44-0', 'cmpl-benchmark-serving46-0', 'cmpl-benchmark-serving48-0', 'cmpl-benchmark-serving51-0', 'cmpl-benchmark-serving52-0', 'cmpl-benchmark-serving54-0', 'cmpl-benchmark-serving58-0', 'cmpl-benchmark-serving59-0', 'cmpl-benchmark-serving62-0', 'cmpl-benchmark-serving64-0', 'cmpl-benchmark-serving66-0', 'cmpl-benchmark-serving68-0', 'cmpl-benchmark-serving69-0', 'cmpl-benchmark-serving72-0', 'cmpl-benchmark-serving74-0', 'cmpl-benchmark-serving76-0', 'cmpl-benchmark-serving80-0', 'cmpl-benchmark-serving84-0', 'cmpl-benchmark-serving86-0', 'cmpl-benchmark-serving87-0', 'cmpl-benchmark-serving90-0', 'cmpl-benchmark-serving92-0', 'cmpl-benchmark-serving94-0', 'cmpl-benchmark-serving97-0', 'cmpl-benchmark-serving99-0', 'cmpl-benchmark-serving101-0', 'cmpl-benchmark-serving103-0', 'cmpl-benchmark-serving105-0', 'cmpl-benchmark-serving107-0', 'cmpl-benchmark-serving111-0', 'cmpl-benchmark-serving115-0', 'cmpl-benchmark-serving117-0', 'cmpl-benchmark-serving119-0', 'cmpl-benchmark-serving123-0', 'cmpl-benchmark-serving125-0', 'cmpl-benchmark-serving127-0'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[], resumed_req_token_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], new_block_ids=[null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null], num_computed_tokens=[2558, 2557, 2557, 2557, 2556, 2556, 2556, 2555, 2555, 2555, 2555, 2554, 2554, 2554, 2554, 2553, 2553, 2553, 2552, 2552, 2552, 2551, 2551, 2551, 2551, 2550, 2550, 2550, 2549, 2549, 2549, 2549, 2548, 2548, 2548, 2547, 2547, 2547, 2546, 2546, 2546, 2546, 2545, 2545, 2545, 2545, 2544, 2544, 2544, 2543, 2543, 2543, 2542, 2542, 2542], num_output_tokens=[511, 510, 510, 510, 509, 509, 509, 508, 508, 508, 508, 507, 507, 507, 507, 506, 506, 506, 505, 505, 505, 504, 504, 504, 504, 503, 503, 503, 502, 502, 502, 502, 501, 501, 501, 500, 500, 500, 499, 499, 499, 499, 498, 498, 498, 498, 497, 497, 497, 496, 496, 496, 495, 495, 495]), num_scheduled_tokens={cmpl-benchmark-serving6-0: 1, cmpl-benchmark-serving74-0: 1, cmpl-benchmark-serving8-0: 1, cmpl-benchmark-serving107-0: 1, cmpl-benchmark-serving127-0: 1, cmpl-benchmark-serving52-0: 1, cmpl-benchmark-serving54-0: 1, cmpl-benchmark-serving101-0: 1, cmpl-benchmark-serving46-0: 1, cmpl-benchmark-serving44-0: 1, cmpl-benchmark-serving30-0: 1, cmpl-benchmark-serving24-0: 1, cmpl-benchmark-serving92-0: 1, cmpl-benchmark-serving62-0: 1, cmpl-benchmark-serving90-0: 1, cmpl-benchmark-serving99-0: 1, cmpl-benchmark-serving119-0: 1, cmpl-benchmark-serving76-0: 1, cmpl-benchmark-serving115-0: 1, cmpl-benchmark-serving37-0: 1, cmpl-benchmark-serving36-0: 1, cmpl-benchmark-serving111-0: 1, cmpl-benchmark-serving18-0: 1, cmpl-benchmark-serving14-0: 1, cmpl-benchmark-serving4-0: 1, cmpl-benchmark-serving16-0: 1, cmpl-benchmark-serving59-0: 1, cmpl-benchmark-serving0-0: 1, cmpl-benchmark-serving58-0: 1, cmpl-benchmark-serving64-0: 1, cmpl-benchmark-serving22-0: 1, cmpl-benchmark-serving97-0: 1, cmpl-benchmark-serving103-0: 1, cmpl-benchmark-serving87-0: 1, cmpl-benchmark-serving117-0: 1, cmpl-benchmark-serving40-0: 1, cmpl-benchmark-serving48-0: 1, cmpl-benchmark-serving123-0: 1, cmpl-benchmark-serving32-0: 1, cmpl-benchmark-serving2-0: 1, cmpl-benchmark-serving51-0: 1, cmpl-benchmark-serving69-0: 1, cmpl-benchmark-serving80-0: 1, cmpl-benchmark-serving105-0: 1, cmpl-benchmark-serving86-0: 1, cmpl-benchmark-serving68-0: 1, cmpl-benchmark-serving66-0: 1, cmpl-benchmark-serving10-0: 1, cmpl-benchmark-serving20-0: 1, cmpl-benchmark-serving125-0: 1, cmpl-benchmark-serving84-0: 1, cmpl-benchmark-serving29-0: 1, cmpl-benchmark-serving72-0: 1, cmpl-benchmark-serving26-0: 1, cmpl-benchmark-serving94-0: 1}, total_num_scheduled_tokens=55, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=['cmpl-benchmark-serving34-0'], free_encoder_mm_hashes=[], structured_output_request_ids=[], grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=55, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.022279695874361183, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, spec_decoding_stats=None, kv_connector_stats=None, num_corrupted_reqs=0)
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] Traceback (most recent call last):
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/engine/core.py", line 772, in run_engine_core
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/engine/core.py", line 1149, in run_busy_loop
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     executed = self._process_engine_step()
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/engine/core.py", line 828, in _process_engine_step
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/engine/core.py", line 318, in step
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     model_output = self.model_executor.execute_model(scheduler_output)
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/executor/multiproc_executor.py", line 185, in execute_model
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     (output,) = self.collective_rpc(
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]                 ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/executor/multiproc_executor.py", line 283, in collective_rpc
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     result = get_response(w, dequeue_timeout, self.shutdown_event)
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]   File "/home/zjy/code/vllm-src/vllm/v1/executor/multiproc_executor.py", line 264, in get_response
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781]     raise RuntimeError(
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1537238) ERROR 10-27 20:05:10 [core.py:781] ', please check the stack trace above for the root cause
(Worker_DP0_TP0_EP0 pid=1539019) INFO 10-27 20:05:10 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP0_TP0_EP0 pid=1539019) INFO 10-27 20:05:10 [multiproc_executor.py:629] WorkerProc shutting down.
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534] AsyncLLM output_handler failed.
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534] Traceback (most recent call last):
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534]   File "/home/zjy/code/vllm-src/vllm/v1/engine/async_llm.py", line 488, in output_handler
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534]     outputs = await engine_core.get_output_async()
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534]   File "/home/zjy/code/vllm-src/vllm/v1/engine/core_client.py", line 882, in get_output_async
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534]     raise self._format_exception(outputs) from None
(APIServer pid=1533589) ERROR 10-27 20:05:10 [async_llm.py:534] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker_DP0_TP1_EP1 pid=1539020) INFO 10-27 20:05:10 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP0_TP2_EP2 pid=1539021) INFO 10-27 20:05:10 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP0_TP3_EP3 pid=1539022) INFO 10-27 20:05:10 [multiproc_executor.py:588] Parent process exited, terminating worker
[rank1]:[E1027 20:05:10.760395695 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2aaa57cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f2b24d66fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f2aab47cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f2aab48c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f2aab4906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f2aab49262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f2b1d6e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f2b331e73fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f2b32ee2e83 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2aaa57cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f2b24d66fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f2aab47cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f2aab48c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f2aab4906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f2aab49262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f2b1d6e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f2b331e73fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f2b32ee2e83 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f2aaa57cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe68aa1 (0x7f2aab468aa1 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x95124f (0x7f2aaaf5124f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd8b74 (0x7f2b1d6e8b74 in /lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x93fb (0x7f2b331e73fb in /lib64/libpthread.so.0)
frame #5: clone + 0x43 (0x7f2b32ee2e83 in /lib64/libc.so.6)

[rank2]:[E1027 20:05:10.784995018 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f733777cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f73b1f66fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f733867cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f733868c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f73386906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f733869262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f73aa8e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f73c037a3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f73c0075e83 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f733777cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f73b1f66fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f733867cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f733868c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f73386906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f733869262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f73aa8e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f73c037a3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f73c0075e83 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f733777cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe68aa1 (0x7f7338668aa1 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x95124f (0x7f733815124f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd8b74 (0x7f73aa8e8b74 in /lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x93fb (0x7f73c037a3fb in /lib64/libpthread.so.0)
frame #5: clone + 0x43 (0x7f73c0075e83 in /lib64/libc.so.6)

[rank3]:[E1027 20:05:10.819632115 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6cbcd7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f6d37566fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f6cbdc7cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f6cbdc8c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f6cbdc906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f6cbdc9262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f6d2fee8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f6d4596e3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f6d45669e83 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6cbcd7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7f6d37566fb7 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f6cbdc7cf30 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f6cbdc8c4b8 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f6cbdc906b9 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f6cbdc9262f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f6d2fee8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f6d4596e3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f6d45669e83 in /lib64/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6cbcd7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe68aa1 (0x7f6cbdc68aa1 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x95124f (0x7f6cbd75124f in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xd8b74 (0x7f6d2fee8b74 in /lib64/libstdc++.so.6)
frame #4: <unknown function> + 0x93fb (0x7f6d4596e3fb in /lib64/libpthread.so.0)
frame #5: clone + 0x43 (0x7f6d45669e83 in /lib64/libc.so.6)

(APIServer pid=1533589) INFO:     Shutting down
(APIServer pid=1533589) INFO:     Waiting for application shutdown.
(APIServer pid=1533589) INFO:     Application shutdown complete.
(APIServer pid=1533589) INFO:     Finished server process [1533589]
(Worker_DP1_TP0_EP4 pid=1539005) INFO 10-27 20:05:14 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP1_TP1_EP5 pid=1539006) INFO 10-27 20:05:14 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP1_TP2_EP6 pid=1539007) INFO 10-27 20:05:14 [multiproc_executor.py:588] Parent process exited, terminating worker
(Worker_DP1_TP3_EP7 pid=1539008) INFO 10-27 20:05:14 [multiproc_executor.py:588] Parent process exited, terminating worker
[rank6]:[W1027 20:05:14.938690995 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=103, addr=[::ffff:127.0.0.1]:38414, remote=[::ffff:127.0.0.1]:43859): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:697 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f726cf7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffd551 (0x7f72ca3fd551 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffe94d (0x7f72ca3fe94d in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5fff4fa (0x7f72ca3ff4fa in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f72ca3fa21e in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f726de89a88 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f72e00e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f72f5b7c3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f72f5877e83 in /lib64/libc.so.6)

[rank6]:[W1027 20:05:14.942663394 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank4]:[W1027 20:05:15.050015520 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=103, addr=[::ffff:127.0.0.1]:38418, remote=[::ffff:127.0.0.1]:43859): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:697 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f0409f7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffd551 (0x7f03ecbfd551 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffe94d (0x7f03ecbfe94d in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5fff4fa (0x7f03ecbff4fa in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f03ecbfa21e in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f0390689a88 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f04028e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f041847f3fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f041817ae83 in /lib64/libc.so.6)

[rank4]:[W1027 20:05:15.053335874 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 4] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank7]:[W1027 20:05:15.052888213 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=103, addr=[::ffff:127.0.0.1]:38426, remote=[::ffff:127.0.0.1]:43859): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:697 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f20acb7cb80 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffd551 (0x7f208f7fd551 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffe94d (0x7f208f7fe94d in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5fff4fa (0x7f208f7ff4fa in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f208f7fa21e in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f2033289a88 in /home/zjy/code/vllm-src/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xd8b74 (0x7f20a54e8b74 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x93fb (0x7f20bb0783fb in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f20bad73e83 in /lib64/libc.so.6)

```

</details>

And `--enable-expert-parallel -tp 8` will not fail

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: qwen3-next failed with CUDA error: an illegal memory access was encountered #27571

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: qwen3-next failed with CUDA error: an illegal memory access was encountered #27571

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions