Open
Description
Your current environment
PyTorch 2.7.0, vLLM main branch built from source.
🐛 Describe the bug
Repro:
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --tensor-parallel-size 8 --max-num-batched-tokens 40000 --max-model-len 8192 --max-num-seqs 128 --gpu-memory-utilization 0.8
gives a CUDA Illegal Memory Access, as well as some errors:
ERROR 06-13 15:32:09 [core.py:515] EngineCore failed to start.
ERROR 06-13 15:32:09 [core.py:515] Traceback (most recent call last):
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-13 15:32:09 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-13 15:32:09 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-13 15:32:09 [core.py:515] super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-13 15:32:09 [core.py:515] self._initialize_kv_caches(vllm_config)
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/engine/core.py", line 168, in _initialize_kv_caches
ERROR 06-13 15:32:09 [core.py:515] self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/abstract.py", line 66, in initialize_from_config
ERROR 06-13 15:32:09 [core.py:515] self.collective_rpc("compile_or_warm_up_model")
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-13 15:32:09 [core.py:515] result = get_response(w, dequeue_timeout)
ERROR 06-13 15:32:09 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-13 15:32:09 [core.py:515] File "/home/rzou/dev/stable0/vllm-stable0/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-13 15:32:09 [core.py:515] raise RuntimeError(
ERROR 06-13 15:32:09 [core.py:515] RuntimeError: Worker failed with error 'Expected result >= 0 to be true, but got false. (Could this error message be
improved? If so, please report an enhancement request to PyTorch.)', please check the stack trace above for the root cause
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527] File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/cuda_piece
wise_backend.py", line 156, in __call__
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527] return entry.runnable(*args)
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527] File "/home/rzou/.cache/vllm/torch_compile_cache/d98525c527/rank_2_0/
inductor_cache/rl/crl3f6qy7nm5k2qs65o6f44vppuehyqkkmjhxy6q5mty7zgba2kx.py", line 1282, in call
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527] File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/cuda_piece
wise_backend.py", line 156, in __call__
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527] File "/home/rzou/dev/stable0/vllm-stable0/vllm/compilation/compiler_i
nterface.py", line 510, in compiled_graph
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527] buf52 = empty_strided_cuda(((-32768) + s0, ), (1, ), torch.int32)
(VllmWorker rank=5 pid=3350871) ERROR 06-13 15:32:09 [multiproc_executor.py:527] return self.current_callable(inputs)
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527] return entry.runnable(*args)
(VllmWorker rank=6 pid=3350873) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
(VllmWorker rank=7 pid=3350875) Exception ignored in: <function CustomAllreduce.__del__ at 0x7efceedfe2a0>
(VllmWorker rank=7 pid=3350875) Traceback (most recent call last):
(VllmWorker rank=7 pid=3350875) File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 276, in __
del__
(VllmWorker rank=7 pid=3350875) self.close()
(VllmWorker rank=7 pid=3350875) File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 272, in cl
ose
(VllmWorker rank=7 pid=3350875) self.free_shared_buffer(self.meta_ptrs, rank=self.rank)
(VllmWorker rank=7 pid=3350875) File "/home/rzou/dev/stable0/vllm-stable0/vllm/distributed/device_communicators/custom_all_reduce.py", line 304, in fr
ee_shared_buffer
(VllmWorker rank=7 pid=3350875) ops.free_shared_buffer(pointers[rank])
(VllmWorker rank=7 pid=3350875) File "/home/rzou/dev/stable0/vllm-stable0/vllm/_custom_ops.py", line 1758, in free_shared_buffer
(VllmWorker rank=7 pid=3350875) torch.ops._C_custom_ar.free_shared_buffer(ptr)
(VllmWorker rank=7 pid=3350875) File "/home/rzou/dev/stable0/env/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
(VllmWorker rank=7 pid=3350875) return self._op(*args, **(kwargs or {}))
(VllmWorker rank=7 pid=3350875) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=7 pid=3350875) RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=7 pid=3350875) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=7 pid=3350875) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=7 pid=3350875) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=7 pid=3350875)
(VllmWorker rank=1 pid=3350867) ERROR 06-13 15:32:09 [multiproc_executor.py:527] graph_output = inductor_compiled_graph(list_args)
(VllmWorker rank=5 pid=3350871) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=3350868) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=7 pid=3350875) ERROR 06-13 15:32:09 [multiproc_executor.py:527] ^^^^^^^^^^^^^^^^^^^^^
I think this started from #19168. After turning off the chunking optimization, the errors go away.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Type
Projects
Status
To triage
Status
In Progress