[Bug]: Illegal memory access on llama4 maverick

### Your current environment

PyTorch 2.7.0, vLLM main branch built from source.

### 🐛 Describe the bug

Repro:
```py
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 8 --max-num-batched-tokens 40000 --max-model-len 8192 --max-num-seqs 128 --gpu-memory-utilization 0.8
```
gives a CUDA Illegal Memory Access, as well as some errors:
```
ERROR 06-13 15:32:09 [core.py:515] EngineCore failed to start.
ERROR 06-13 15:32:09 [core.py:515] Traceback (most recent call last):
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-13 15:32:09 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-13 15:32:09 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-13 15:32:09 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-13 15:32:09 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 168, in _initialize_kv_caches
ERROR 06-13 15:32:09 [core.py:515]     self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
ERROR 06-13 15:32:09 [core.py:515]     self.collective_rpc("compile_or_warm_up_model")
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-13 15:32:09 [core.py:515]     result = get_response(w, dequeue_timeout)
ERROR 06-13 15:32:09 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 15:32:09 [core.py:515]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-13 15:32:09 [core.py:515]     raise RuntimeError(
ERROR 06-13 15:32:09 [core.py:515] RuntimeError: Worker failed with error 'Expected result >= 0 to be true, but got false.  (Could this error message be
 improved?  If so, please report an enhancement request to PyTorch.)', please check the stack trace above for the root cause
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/cuda_piece
wise_backend.py", line 156, in __call__
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527]     return entry.runnable(*args)
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527]   File "/home/rzou/.cache/vllm/torch_compile_cache/d98525c527/rank_2_0/
inductor_cache/rl/crl3f6qy7nm5k2qs65o6f44vppuehyqkkmjhxy6q5mty7zgba2kx.py", line 1282, in call
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/cuda_piece
wise_backend.py", line 156, in __call__
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527]   File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/compiler_i
nterface.py", line 510, in compiled_graph
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527]     buf52 = empty_strided_cuda(((-32768) + s0, ), (1, ), torch.int32)
(VllmWorker rank=5 pid=3350871) ERROR 06-13 15:32:09 [multiproc_executor.py:527]     return self.current_callable(inputs)
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527]     return entry.runnable(*args)
(VllmWorker rank=6 pid=3350873) ERROR 06-13 15:32:09 [multiproc_executor.py:527]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

...

(VllmWorker rank=7 pid=3350875) Exception ignored in: <function CustomAllreduce.__del__ at 0x7efceedfe2a0>
(VllmWorker rank=7 pid=3350875) Traceback (most recent call last):
(VllmWorker rank=7 pid=3350875)   File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 276, in __
del__
(VllmWorker rank=7 pid=3350875)     self.close()
(VllmWorker rank=7 pid=3350875)   File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 272, in cl
ose
(VllmWorker rank=7 pid=3350875)     self.free_shared_buffer(self.meta_ptrs, rank=self.rank)
(VllmWorker rank=7 pid=3350875)   File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 304, in fr
ee_shared_buffer
(VllmWorker rank=7 pid=3350875)     ops.free_shared_buffer(pointers[rank])
(VllmWorker rank=7 pid=3350875)   File "/home/rzou/dev/stable0/vllm-stable0/vllm/_custom_ops.py", line 1758, in free_shared_buffer
(VllmWorker rank=7 pid=3350875)     torch.ops._C_custom_ar.free_shared_buffer(ptr)
(VllmWorker rank=7 pid=3350875)   File "/home/rzou/dev/stable0/env/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=7 pid=3350875)     return self._op(*args, **(kwargs or {}))
(VllmWorker rank=7 pid=3350875)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=7 pid=3350875) RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=7 pid=3350875) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=7 pid=3350875) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=7 pid=3350875) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=7 pid=3350875)
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527]     graph_output = inductor_compiled_graph(list_args)
(VllmWorker rank=5 pid=3350871) ERROR 06-13 15:32:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
```

I think this started from https://github.com/vllm-project/vllm/pull/19168. After turning off the chunking optimization, the errors go away.



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Illegal memory access on llama4 maverick #19631

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Illegal memory access on llama4 maverick #19631

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions