[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()

### Your current environment

Docker on 4 x A100 SMX.
BTW: vLLM 0.8.4 worked stable with same setup.
0.9.01 was already unstable (restarted few time a day), now even more.

```
services:
  vllm-qwen25-72b:
    image: vllm/vllm-openai:v0.9.1
    container_name: vllm-qwen25-72b
    environment:
     ...
      - HF_TOKEN=$HF_TOKEN
      - VLLM_NO_USAGE_STATS=1
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1', '2', '3']
              capabilities: [ gpu ]
    network_mode: host
    volumes:
      - /mnt/sda/huggingface:/root/.cache/huggingface
      - .:/opt/vllm
    command:
      - --port=8000
      - --disable-log-requests
      - --model=Qwen/Qwen2.5-72B-Instruct
      # - --served-model-name=Qwen/Qwen2.5-72B-Instruct
      # - --max-model-len=32768
      - --tensor-parallel-size=4
      - --gpu-memory-utilization=0.90
      - --swap-space=5
    restart: unless-stopped

```

### 🐛 Describe the bug

See log file below

vLLM 0.9.1 crashes frequently with Qwen 2.5 on 4xA100 SMX.

(0.9.0.1 also crashed with "CUDA error: an illegal memory access was encountered", but much less frequently and not with a clear hint what went wrong. 0.8.4 was running stable.)

I have no example request - we use a mix of normal and guided JSON sampling.


This might be the main problem?

```
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
```

Full log:

```
[rank0]:[E611 01:51:09.940883637 ProcessGroupNCCL.cpp:1896] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     return func(*args, **kwargs)
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1374, in execute_model

(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]     valid_sampled_token_ids = sampled_token_ids.tolist()
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] RuntimeError: CUDA error: an illegal memory access was encountered
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fa563f0d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fa564365422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fa4f3c8b456 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fa4f3c9b6f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fa4f3c9d282 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fa4f3c9ee8d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1902 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fa563f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7a4e (0x7fa4f3c6da4e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7fa4f38bc5ed in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fa4e3fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fa564c42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fa564cd3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
(VllmWorker rank=0 pid=226) ERROR 06-11 01:51:09 [multiproc_executor.py:527] 
ERROR 06-11 01:51:09 [dump_input.py:69] Dumping input data
ERROR 06-11 01:51:09 [dump_input.py:71] V1 LLM engine (v0.9.1) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen2.5-72B-Instruct/snapshots/d3d951150c1e5848237cd6a7ad11df4836aee842/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}, 
ERROR 06-11 01:51:09 [dump_input.py:79] Dumping scheduler output for model execution:
ERROR 06-11 01:51:09 [dump_input.py:80] SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=[CachedRequestData(req_id='chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb', resumed_from_preemption=false, new_token_ids=[374], new_block_ids=[[]], num_computed_tokens=187), CachedRequestData(req_id='chatcmpl-a0b407784af14747b4a9af20d4d69829', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=2184), CachedRequestData(req_id='chatcmpl-d52cc47002544eaa97785872789929c8', resumed_from_preemption=false, new_token_ids=[330], new_block_ids=[[]], num_computed_tokens=9828), CachedRequestData(req_id='chatcmpl-6505bbb4a369474fb64b00f9e8e36de7', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66), CachedRequestData(req_id='chatcmpl-835ba60f60fe4171b7cc74141ca68a31', resumed_from_preemption=false, new_token_ids=[1008], new_block_ids=[[]], num_computed_tokens=66)], num_scheduled_tokens={chatcmpl-835ba60f60fe4171b7cc74141ca68a31: 1, chatcmpl-6505bbb4a369474fb64b00f9e8e36de7: 1, chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-f8adb07fdf2e41e69e9be99f4f9cc7eb: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 1}, total_num_scheduled_tokens=5, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={chatcmpl-a0b407784af14747b4a9af20d4d69829: 1, chatcmpl-d52cc47002544eaa97785872789929c8: 2}, grammar_bitmask=array([[      0,       0,       2, ...,       0,       0,       0],
ERROR 06-11 01:51:09 [dump_input.py:80]        [      0, 1507336,       0, ...,       0,       0,       0]],
ERROR 06-11 01:51:09 [dump_input.py:80]       shape=(2, 4752), dtype=int32), kv_connector_metadata=null)
ERROR 06-11 01:51:09 [dump_input.py:82] SchedulerStats(num_running_reqs=5, num_waiting_reqs=0, gpu_cache_usage=0.026913812964708295, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None)
ERROR 06-11 01:51:09 [core.py:517] EngineCore encountered a fatal error.
ERROR 06-11 01:51:09 [core.py:517] Traceback (most recent call last):
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
ERROR 06-11 01:51:09 [core.py:517]     engine_core.run_busy_loop()
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
ERROR 06-11 01:51:09 [core.py:517]     self._process_engine_step()
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
ERROR 06-11 01:51:09 [core.py:517]     outputs, model_executed = self.step_fn()
ERROR 06-11 01:51:09 [core.py:517]                               ^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 231, in step
ERROR 06-11 01:51:09 [core.py:517]     model_output = self.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 217, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     raise err
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 211, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     return self.model_executor.execute_model(scheduler_output)
ERROR 06-11 01:51:09 [core.py:517]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 163, in execute_model
ERROR 06-11 01:51:09 [core.py:517]     (output, ) = self.collective_rpc("execute_model",
ERROR 06-11 01:51:09 [core.py:517]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-11 01:51:09 [core.py:517]     result = get_response(w, dequeue_timeout)
ERROR 06-11 01:51:09 [core.py:517]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [core.py:517]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-11 01:51:09 [core.py:517]     raise RuntimeError(
ERROR 06-11 01:51:09 [core.py:517] RuntimeError: Worker failed with error 'CUDA error: an illegal memory access was encountered
ERROR 06-11 01:51:09 [core.py:517] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-11 01:51:09 [core.py:517] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 06-11 01:51:09 [core.py:517] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-11 01:51:09 [core.py:517] ', please check the stack trace above for the root cause
ERROR 06-11 01:51:09 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 06-11 01:51:09 [async_llm.py:420] Traceback (most recent call last):
ERROR 06-11 01:51:09 [async_llm.py:420]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [async_llm.py:420]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [async_llm.py:420]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [async_llm.py:420]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [async_llm.py:420]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
ERROR 06-11 01:51:09 [serving_chat.py:911] Error in chat completion stream generator.
ERROR 06-11 01:51:09 [serving_chat.py:911] Traceback (most recent call last):
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 481, in chat_completion_stream_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/utils.py", line 100, in wrapper
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await func(*args, **kwargs)
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 554, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     generator = await handler.create_chat_completion(request, raw_request)
ERROR 06-11 01:51:09 [serving_chat.py:911]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 268, in create_chat_completion
ERROR 06-11 01:51:09 [serving_chat.py:911]     return await self.chat_completion_full_generator(
ERROR 06-11 01:51:09 [serving_chat.py:911]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 932, in chat_completion_full_generator
ERROR 06-11 01:51:09 [serving_chat.py:911]     async for res in result_generator:
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 327, in generate
ERROR 06-11 01:51:09 [serving_chat.py:911]     out = q.get_nowait() or await q.get()
ERROR 06-11 01:51:09 [serving_chat.py:911]                             ^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 52, in get
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise output
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-11 01:51:09 [serving_chat.py:911]     outputs = await engine_core.get_output_async()
ERROR 06-11 01:51:09 [serving_chat.py:911]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-11 01:51:09 [serving_chat.py:911]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-11 01:51:09 [serving_chat.py:911]     raise self._format_exception(outputs) from None
ERROR 06-11 01:51:09 [serving_chat.py:911] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     127.0.0.1:46320 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     172.19.103.111:36678 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     172.19.103.111:57278 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
[rank2]:[W611 01:51:09.369133081 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37972, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f86bc1785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7f86a023cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7f86a023ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7f86a023f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7f86a02391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7f864be99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f863c1b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f86bce6cac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f86bcefda04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W611 01:51:09.374069347 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank3]:[W611 01:51:09.424776342 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:37988, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fc675f785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7fc65a03cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7fc65a03ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7fc65a03f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7fc65a0391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7fc605c99989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7fc5f5fb3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7fc676ce9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7fc676d7aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[W611 01:51:09.429163290 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 3] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
[rank1]:[W611 01:51:09.436189340 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=86, addr=[localhost]:38002, remote=[localhost]:59835): failed to recv, got 0 bytes
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:678 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7efce171e5e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7efd3683cafe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baae40 (0x7efd3683ee40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5bab74a (0x7efd3683f74a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x2a9 (0x7efd368391a9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7efce2499989 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7efcd27b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7efd53341ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7efd533d2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W611 01:51:09.440752408 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
nanobind: leaked 4 instances!
 - leaked instance 0x7efc44398258 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked instance 0x7efc442df2e8 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7efc4438b798 of type "xgrammar.xgrammar_bindings.CompiledGrammar"
 - leaked instance 0x7efc44396718 of type "xgrammar.xgrammar_bindings.GrammarMatcher"
nanobind: leaked 2 types!
 - leaked type "xgrammar.xgrammar_bindings.GrammarMatcher"
 - leaked type "xgrammar.xgrammar_bindings.CompiledGrammar"
nanobind: leaked 13 functions!
 - leaked function "fill_next_token_bitmask"
 - leaked function "rollback"
 - leaked function "__init__"
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function ""
 - leaked function "find_jump_forward_string"
 - leaked function "reset"
 - leaked function "_debug_accept_string"
 - leaked function "is_terminated"
 - leaked function "accept_token"
nanobind: this is likely caused by a reference counting issue in the binding code.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist() #19483

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist() #19483

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions