Closed
Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
I'm using the vllm 0.7.3 (enable_vllm_v1) to run qwen2_5_vl model.
Initializing a V1 LLM engine (v0.7.3) with config: model='/qwen2_5-vl-72b', speculative_config=None, tokenizer='/qwen2_5-vl-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/qwen2_5-vl-72b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}
Then I run it with a trace. It works fine for a while (maybe around 100 requests succeed), but after that, the cuda kernel crashed. It's reproducible when re-run the same trace.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] WorkerProc hit an exception: %s
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] Traceback (most recent call last):
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in worker_busy_loop
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] output = func(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return func(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 227, in execute_model
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return func(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 957, in execute_model
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] sampler_output = self.model.sample(
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1091, in sample
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return self.language_model.sample(logits, sampling_metadata)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 505, in sample
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] next_tokens = self.sampler(logits, sampling_metadata)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return self._call_impl(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return forward_call(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/sample/sampler.py", line 55, in forward
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] sampled = self.sample(logits, sampling_metadata)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/sample/sampler.py", line 111, in sample
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] random_sampled = self.topk_topp_sampler(
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return self._call_impl(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] return forward_call(*args, **kwargs)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 62, in forward_native
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] logits = apply_top_k_top_p(logits, k, p)
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] File "/usr/local/python-3.10.14/lib/python3.10/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 100, in apply_top_k_top_p
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] top_k_mask = logits_sort.gather(1, top_k_mask.unsqueeze(dim=1))
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] RuntimeError: CUDA error: device-side assert triggered
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=3 pid=271990) ERROR 03-04 01:21:00 multiproc_executor.py:374]
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of
bounds"../aten/src/ATen/native/cuda/ScatterGatherKernel.cu` failed.
:144../aten/src/ATen/native/cuda/ScatterGatherKernel.cu: operator():144: block: [0: operator(),0: block: [0,0,0], thread: [0,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds",0` failed.
] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"../aten/src/ATen/native/cuda/ScatterGatherKernel.cu` failed.
:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f551c56c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f551c5166e4 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f551d0a5a18 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f54cdfc7726 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f54cdfcc3f0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f54cdfd3b5a in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f54cdfd561d in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f551d4c95c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x93fb (0x7f551ebd63fb in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x7f551e51be83 in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank2]:[E304 01:21:01.867394363 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9272b6c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9272b166e4 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f92732a5a18 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f92241c7726 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f92241cc3f0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f92241d3b5a in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f92241d561d in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f9274e225c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x93fb (0x7f9274ebc3fb in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x7f927471be83 in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank3]:[E304 01:21:01.867443735 ProcessGroupNCCL.cpp:1595] [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f782c96c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f782c9166e4 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f782d4a5a18 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f77de3c7726 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f77de3cc3f0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f77de3d3b5a in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f77de3d561d in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f782d86f5c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x93fb (0x7f782ebbb3fb in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x7f782e8c1e83 in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f551c56c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f551c5166e4 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f551d0a5a18 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f54cdfc7726 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f54cdfcc3f0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f54cdfd3b5a in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f54cdfd561d in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f551d4c95c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x93fb (0x7f551ebd63fb in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x7f551e51be83 in /usr/lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f551c56c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f54cdc4271b in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f551d4c95c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x93fb (0x7f551ebd63fb in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x43 (0x7f551e51be83 in /usr/lib64/libc.so.6)
what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9272b6c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9272b166e4 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f92732a5a18 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f92241c7726 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f92241cc3f0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f92241d3b5a in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f92241d561d in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x145c0 (0x7f9274e225c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x93fb (0x7f9274ebc3fb in /usr/lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x7f927471be83 in /usr/lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9272b6c446 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4271b (0x7f9223e4271b in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f9274e225c0 in /usr/local/python-3.10.14/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x93fb (0x7f9274ebc3fb in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x43 (0x7f927471be83 in /usr/lib64/libc.so.6)
what(): [PG ID 2 PG GUID 3 Rank 3] Process group watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.