Skip to content

[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist() #19483

Open
@andrePankraz

Description

@andrePankraz

Your current environment

Docker on 4 x A100 SMX.
BTW: vLLM 0.8.4 worked stable with same setup.
0.9.01 was already unstable (restarted few time a day), now even more.

services:
  vllm-qwen25-72b:
    image: vllm/vllm-openai:v0.9.1
    container_name: vllm-qwen25-72b
    environment:
     ...
      - HF_TOKEN=$HF_TOKEN
      - VLLM_NO_USAGE_STATS=1
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1', '2', '3']
              capabilities: [ gpu ]
    network_mode: host
    volumes:
      - /mnt/sda/huggingface:/root/.cache/huggingface
      - .:/opt/vllm
    command:
      - --port=8000
      - --disable-log-requests
      - --model=Qwen/Qwen2.5-72B-Instruct
      # - --served-model-name=Qwen/Qwen2.5-72B-Instruct
      # - --max-model-len=32768
      - --tensor-parallel-size=4
      - --gpu-memory-utilization=0.90
      - --swap-space=5
    restart: unless-stopped

🐛 Describe the bug

See log file below

vLLM 0.9.1 crashes frequently with Qwen 2.5 on 4xA100 SMX.

(0.9.0.1 also crashed with "CUDA error: an illegal memory access was encountered", but much less frequently and not with a clear hint what went wrong. 0.8.4 was running stable.)

I have no example request - we use a mix of normal and guided JSON sampling.

This might be the main problem?

(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered

Full log:

[rank0]:[E611 01:51:09.940883637 ProcessGroupNCCL.cpp:1896] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model

(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7a4e (0x7fa4f3c6da4e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7fa4f38bc5ed in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
ERROR 06-11 01:51:09 [dump_input.py:69] Dumping input data
ERROR 06-11 01:51:09 [dump_input.py:71] V1 LLM engine (v0.9.1) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}, 
ERROR 06-11 01:51:09 [dump_input.py:79] Dumping scheduler output for model execution:
ERROR 06-11 01:51:09 [dump_input.py:80] SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=[CachedRequestData(req_id='chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb', resumed_from_preemption=false, new_token_ids=[374], new_block_ids=[[]], num_computed_tokens=187), CachedRequestData(req_id='chatcmpl-a0b407784af14747b4a9af20d4d69829', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=2184), CachedRequestData(req_id='chatcmpl-d52cc47002544eaa97785872789929c8', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=9828), CachedRequestData(req_id='chatcmpl-6505bbb4a369474fb64b00f9e8e36de7', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66), CachedRequestData(req_id='chatcmpl-835ba60f60fe4171b7cc74141ca68a31', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66)], num_scheduled_tokens={chatcmpl-835ba60f60fe4171b7cc74141ca68a31: 1, chatcmpl-6505bbb4a369474fb64b00f9e8e36de7: 1, chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 1}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 2}, grammar_bitmask=array([[      0,       0,       2, ...,       0,       0,       0],
ERROR 06-11 01:51:09 [dump_input.py:80]        [      0, 1507336,       0, ...,       0,       0,       0]],
ERROR 06-11 01:51:09 [dump_input.py:80]       shape=(2, 4752), dtype=int32), kv_connector_metadata=null)
ERROR 06-11 01:51:09 [dump_input.py:82] SchedulerStats(num_running_reqs=5, num_waiting_reqs=0, gpu_cache_usage=0.026913812964708295, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None)
ERROR 06-11 01:51:09 [core.py:517] EngineCore encountered a fatal error.
ERROR 06-11 01:51:09 [core.py:517] Traceback (most recent call last):
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
ERROR 06-11 01:51:09 [core.py:517]     engine_core.run_busy_loop()
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
ERROR 06-11 01:51:09 [core.py:517]     self._process_engine_step()
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
ERROR 06-11 01:51:09 [core.py:517]     outputs, model_executed = self.step_fn()
ERROR 06-11 01:51:09 [core.py:517]                               ^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 231, in step
ERROR 06-11 01:51:09 [core.py:517]     model_output = self.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 217, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     raise err
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 211, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     return self.model_executor.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     (output, ) = self.collective_rpc("execute_model",
ERROR 06-11 01:51:09 [core.py:517]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-11 01:51:09 [core.py:517]     result = get_response(w, dequeue_timeout)
ERROR 06-11 01:51:09 [core.py:517]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-11 01:51:09 [core.py:517]     raise RuntimeError(
ERROR 06-11 01:51:09 [core.py:517] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
ERROR 06-11 01:51:09 [core.py:517] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-11 01:51:09 [core.py:517] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 06-11 01:51:09 [core.py:517] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-11 01:51:09 [core.py:517] ', please check the stack trace above for the root cause
ERROR 06-11 01:51:09 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 06-11 01:51:09 [async_llm.py:420] Traceback (most recent call last):
ERROR 06-11 01:51:09 [async_llm.py:420]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [async_llm.py:420]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [async_llm.py:420]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [async_llm.py:420]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [async_llm.py:420]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     127.0.0.1:46320 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     172.19.103.111:36678 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     172.19.103.111:57278 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
[rank2]:[W611 01:51:09.369133081 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37972, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f86bc1785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7f86a023cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7f86a023ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7f86a023f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7f86a02391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7f864be99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f863c1b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f86bce6cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f86bcefda04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W611 01:51:09.374069347 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank3]:[W611 01:51:09.424776342 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37988, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fc675f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7fc65a03cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7fc65a03ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7fc65a03f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7fc65a0391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7fc605c99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fc5f5fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fc676ce9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7fc676d7aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W611 01:51:09.429163290 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank1]:[W611 01:51:09.436189340 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:38002, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7efce171e5e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7efd3683cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7efd3683ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7efd3683f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7efd368391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7efce2499989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7efcd27b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7efd53341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7efd533d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W611 01:51:09.440752408 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
nanobind: leaked 4 instances!
 - leaked instance 0x7efc44398258 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked instance 0x7efc442df2e8 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7efc4438b798 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7efc44396718 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 2 types!
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
nanobind: leaked 13 functions!
 - leaked function "fill_next_token_bitmask"
 - leaked function "rollback"
 - leaked function "__init__"
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function "find_jump_forward_string"
 - leaked function "reset"
 - leaked function "_debug_accept_string"
 - leaked function "is_terminated"
 - leaked function "accept_token"
nanobind: this is likely caused by a reference counting issue in the binding code.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions