Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered #6976

Closed
chenchunhui97 opened this issue Jul 31, 2024 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@chenchunhui97
Copy link

chenchunhui97 commented Jul 31, 2024

Your current environment

vllm docker v0.5.0post1,
GPU: 4090
cuda driver: Driver Version: 535.86.10

model: qwen1.5-14b-chat-AWQ, with enable-prefix-caching

🐛 Describe the bug

ERROR 07-31 15:13:06 async_llm_engine.py:61] Engine background task failed
ERROR 07-31 15:13:06 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
ERROR 07-31 15:13:06 async_llm_engine.py:61] return_value = task.result()
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 548, in run_engine_loop
ERROR 07-31 15:13:06 async_llm_engine.py:61] has_requests_in_progress = await asyncio.wait_for(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 07-31 15:13:06 async_llm_engine.py:61] return fut.result()
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 522, in engine_step
ERROR 07-31 15:13:06 async_llm_engine.py:61] request_outputs = await self.engine.step_async()
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 244, in step_async
ERROR 07-31 15:13:06 async_llm_engine.py:61] output = await self.model_executor.execute_model_async(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
ERROR 07-31 15:13:06 async_llm_engine.py:61] output = await make_async(self.driver_worker.execute_model
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 07-31 15:13:06 async_llm_engine.py:61] result = self.fn(*self.args, **self.kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-31 15:13:06 async_llm_engine.py:61] return func(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
ERROR 07-31 15:13:06 async_llm_engine.py:61] output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 07-31 15:13:06 async_llm_engine.py:61] return func(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
ERROR 07-31 15:13:06 async_llm_engine.py:61] hidden_states = model_executable(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] hidden_states, residual = layer(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] hidden_states = self.self_attn(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 07-31 15:13:06 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] return self.impl.forward(query, key, value, kv_cache, attn_metadata,
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 339, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] output[:num_prefill_tokens] = flash_attn_varlen_func(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
ERROR 07-31 15:13:06 async_llm_engine.py:61] return FlashAttnVarlenFunc.apply(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
ERROR 07-31 15:13:06 async_llm_engine.py:61] return super().apply(*args, **kwargs) # type: ignore[misc]
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
ERROR 07-31 15:13:06 async_llm_engine.py:61] File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
ERROR 07-31 15:13:06 async_llm_engine.py:61] out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
ERROR 07-31 15:13:06 async_llm_engine.py:61] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 07-31 15:13:06 async_llm_engine.py:61] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
ERROR 07-31 15:13:06 async_llm_engine.py:61]
Exception in callback functools.partial(<function _log_task_completion at 0x7f47bdda4ca0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f47bb414580>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f47bdda4ca0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f47bb414580>>)>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 548, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 522, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 244, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
output = await make_async(self.driver_worker.execute_model
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 280, in execute_model
output = self.model_runner.execute_model(seq_group_metadata_list,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749, in execute_model
hidden_states = model_executable(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
INFO 07-31 15:13:06 async_llm_engine.py:176] Aborted request cmpl-9ee39e0e594c4e7c817ce54f27d62a41.
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 330, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 254, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
INFO 07-31 15:13:06 async_llm_engine.py:176] Aborted request cmpl-8f691facb4ad41d08a2b1816d63b9a37.
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 153, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/layer.py", line 89, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/usr/local/lib/python3.10/dist-packages/vllm/attention/backends/flash_attn.py", line 339, in forward
output[:num_prefill_tokens] = flash_attn_varlen_func(
File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/usr/local/lib/python3.10/dist-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
INFO 07-31 15:13:06 async_llm_engine.py:176] Aborted request cmpl-b479d70e16ba4daa8bf07a1d3c0bb295.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@chenchunhui97 chenchunhui97 added the bug Something isn't working label Jul 31, 2024
@chenchunhui97
Copy link
Author

when I disable prefix-caching, it seems better, it dose not raise this error.

@weimakeit
Copy link

weimakeit commented Jul 31, 2024

Same issue here, using GPTQ model with prefix caching enabled. Having tried vllm == 0.5.0.post1, 0.5.2, 0.5.3.post1 with gptq-marlin kernel functioning.
When using backend == xformers, it seems all right but the inference speed is slower than using flash attention backend.
My use case is offline batch inference.

@markovalexander
Copy link

Same error with Qwen-2-72B ; Llama-2-70b and Mixtral both 8x7 and 8x22 models. Using xformers backend helps but significantly slows inference speed

@markovalexander
Copy link

Issue appears only on high load though, when server receives a lot of parallel requests

@JaheimLee
Copy link

Same issue. Flashinfer also works well.

@chenchunhui97
Copy link
Author

it seems solved in v0.5.4

@TangJiakai
Copy link

它似乎在 v0.5.4 中得到了解决

No, I still meet this problem in version 0.6.1.post2

@TragedyN
Copy link

I meet this problem in version 0.6.1.post2. with -num_scheduler_steps 8 \ --enable_prefix_caching True \

2024-09-25 10:20:39,088 vllm.engine.async_llm_engine 2104 ERROR Engine background task failed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 48, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 733, in run_engine_loop
result = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 673, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in step_async
outputs = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 185, in execute_model_async
output = await make_async(self.driver_worker.execute_model
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model
output = self.model_runner.execute_model(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 458, in execute_model
outputs = self._final_process_outputs(model_input,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 312, in _final_process_outputs
output.pythonize(model_input, self._copy_stream,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 87, in pythonize
self._pythonize_sampler_output(input_metadata, copy_stream,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/multi_step_model_runner.py", line 115, in _pythonize_sampler_output
self.sampler_output_ready_event.synchronize()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/streams.py", line 225, in synchronize
super().synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@zoltan-fedor
Copy link

zoltan-fedor commented Sep 27, 2024

Still happening with v0.6.2 (it crashed 4 times in 20 minutes):

│ ERROR 09-27 06:46:46 engine.py:157] RuntimeError: CUDA error: an illegal memory access was encountered │
│ ERROR 09-27 06:46:46 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. │
│ ERROR 09-27 06:46:46 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 │
│ ERROR 09-27 06:46:46 engine.py:157] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. │
│ ERROR 09-27 06:46:46 engine.py:157] │
│ ERROR 09-27 06:46:46 engine.py:157] │
│ ERROR 09-27 06:46:46 engine.py:157] The above exception was the direct cause of the following exception: │
│ ERROR 09-27 06:46:46 engine.py:157] │
│ ERROR 09-27 06:46:46 engine.py:157] Traceback (most recent call last): │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 155, in start │
│ ERROR 09-27 06:46:46 engine.py:157] self.run_engine_loop() │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 218, in run_engine_loop │
│ ERROR 09-27 06:46:46 engine.py:157] request_outputs = self.engine_step() │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 236, in engine_step │
│ ERROR 09-27 06:46:46 engine.py:157] raise e │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 227, in engine_step │
│ ERROR 09-27 06:46:46 engine.py:157] return self.engine.step() │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 1264, in step │
│ ERROR 09-27 06:46:46 engine.py:157] outputs = self.model_executor.execute_model( │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 332, in execute_model │
│ ERROR 09-27 06:46:46 engine.py:157] return super().execute_model(execute_model_req) │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 78, in execute_model │
│ ERROR 09-27 06:46:46 engine.py:157] driver_outputs = self._driver_execute_model(execute_model_req) │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_gpu_executor.py", line 325, in _driver_execute_model │
│ ERROR 09-27 06:46:46 engine.py:157] return self.driver_worker.execute_method("execute_model", │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 465, in execute_method │
│ ERROR 09-27 06:46:46 engine.py:157] raise e │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method │
│ ERROR 09-27 06:46:46 engine.py:157] return executor(*args, **kwargs) │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 327, in execute_model │
│ ERROR 09-27 06:46:46 engine.py:157] output = self.model_runner.execute_model( │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context │
│ ERROR 09-27 06:46:46 engine.py:157] return func(*args, **kwargs) │
│ ERROR 09-27 06:46:46 engine.py:157] ^^^^^^^^^^^^^^^^^^^^^ │
│ ERROR 09-27 06:46:46 engine.py:157] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper │
│ ERROR 09-27 06:46:46 engine.py:157] raise type(err)(f"Error in model execution: " │
│ ERROR 09-27 06:46:46 engine.py:157] RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered │
│ ERROR 09-27 06:46:46 engine.py:157] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. │
│ ERROR 09-27 06:46:46 engine.py:157] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 │
│ ERROR 09-27 06:46:46 engine.py:157] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. │
│ ERROR 09-27 06:46:46 engine.py:157] │
│ INFO: Shutting down │
│ [2024-09-27 06:46:46,666 E 61 3295] logging.cc:115: Stack trace: │
│ /usr/local/lib/python3.12/dist-packages/ray/_raylet.so(+0x10d0c0a) [0x7f841273bc0a] ray::operator<<() │
│ /usr/local/lib/python3.12/dist-packages/ray/_raylet.so(+0x10d3e92) [0x7f841273ee92] ray::TerminateHandler() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f856604f37c] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f856604f3e7] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f856604f36f] │
│ /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f8518560b35] c10d::ProcessGroupNCCL::ncclCommWatchdog() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f856607bdf4] │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f856729b609] start_thread │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f85673d5353] __clone │
│ │
│ *** SIGABRT received at time=1727444806 on cpu 40 *** │
│ PC: @ 0x7f85672f900b (unknown) raise │
│ @ 0x7f85672f9090 3216 (unknown) │
│ @ 0x7f856604f37c (unknown) (unknown) │
│ @ 0x7f856604f090 (unknown) (unknown) │
│ [2024-09-27 06:46:46,668 E 61 3295] logging.cc:440: *** SIGABRT received at time=1727444806 on cpu 40 *** │
│ [2024-09-27 06:46:46,668 E 61 3295] logging.cc:440: PC: @ 0x7f85672f900b (unknown) raise │
│ [2024-09-27 06:46:46,668 E 61 3295] logging.cc:440: @ 0x7f85672f9090 3216 (unknown) │
│ [2024-09-27 06:46:46,668 E 61 3295] logging.cc:440: @ 0x7f856604f37c (unknown) (unknown) │
│ [2024-09-27 06:46:46,669 E 61 3295] logging.cc:440: @ 0x7f856604f090 (unknown) (unknown) │
│ Fatal Python error: Aborted │

And also with:
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] return get_tp_group().all_reduce(input_) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 352, in all_reduce │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] torch.ops.vllm.inplace_all_reduce(input_, │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/ops.py", line 1061, in call
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] return self
._op(*args, **(kwargs or {})) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 236, in backend_impl │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] result = self._backend_fns[device_type](*args, **kwargs) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) │
│ (RayWorkerWrapper pid=1037) │
│ (RayWorkerWrapper pid=1037) │
│ (RayWorkerWrapper pid=1037) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 108, in inplace_all_reduce │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] group._all_reduce(tensor) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 382, in all_reduce │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] torch.distributed.all_reduce(input
, group=self.device_group) │
│ (RayWorkerWrapper pid=1037) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] msg_dict = _get_msg_dict(func.name, *args, **kwargs) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 54, in _get_msg_dict │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] "args": f"{args}, {kwargs}", │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 463, in repr
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] return torch._tensor_str._str(self, tensor_contents=tensor_contents) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py", line 698, in _str │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] return _str_intern(self, tensor_contents=tensor_contents) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py", line 618, in _str_intern │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] tensor_str = _tensor_str(self, indent) │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] File "/usr/local/lib/python3.12/dist-packages/torch/_tensor_str.py", line 332, in _tensor_str │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] self = self.float() │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] ^^^^^^^^^^^^ │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] RuntimeError: CUDA error: an illegal memory access was encountered │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] The above exception was the direct cause of the following exception: │
│ (RayWorkerWrapper pid=1037) ERROR 09-27 06:46:55 worker_base.py:464] │
│ (RayWorkerWrapper pid=1033) │
│ (RayWorkerWrapper pid=1033) │
│ (RayWorkerWrapper pid=1033) │
│ (RayWorkerWrapper pid=1033) │
│ (RayWorkerWrapper pid=1033) │
│ (RayWorkerWrapper pid=1033) [rank3]:[W927 06:46:55.489408656 CUDAGuardImpl.h:119] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) │
│ [2024-09-27 06:46:55,647 E 61 3291] logging.cc:115: Stack trace: │
│ /usr/local/lib/python3.12/dist-packages/ray/_raylet.so(+0x10d0c0a) [0x7ff0e47d2c0a] ray::operator<<() │
│ /usr/local/lib/python3.12/dist-packages/ray/_raylet.so(+0x10d3e92) [0x7ff0e47d5e92] ray::TerminateHandler() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7ff2380e637c] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7ff2380e63e7] │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7ff2380e636f] │
│ /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7ff1ea5f7b35] c10d::ProcessGroupNCCL::ncclCommWatchdog() │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7ff238112df4] │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff239332609] start_thread │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff23946c353] __clone │
│ │
│ *** SIGABRT received at time=1727444815 on cpu 43 *** │
│ PC: @ 0x7ff23939000b (unknown) raise │
│ @ 0x7ff239390090 3216 (unknown) │
│ @ 0x7ff2380e637c (unknown) (unknown) │
│ INFO: Shutting down │
│ @ 0x7ff2380e6090 (unknown) (unknown) │
│ [2024-09-27 06:46:55,649 E 61 3291] logging.cc:440: *** SIGABRT received at time=1727444815 on cpu 43 *** │
│ [2024-09-27 06:46:55,649 E 61 3291] logging.cc:440: PC: @ 0x7ff23939000b (unknown) raise │
│ [2024-09-27 06:46:55,649 E 61 3291] logging.cc:440: @ 0x7ff239390090 3216 (unknown) │
│ [2024-09-27 06:46:55,649 E 61 3291] logging.cc:440: @ 0x7ff2380e637c (unknown) (unknown) │
│ [2024-09-27 06:46:55,650 E 61 3291] logging.cc:440: @ 0x7ff2380e6090 (unknown) (unknown) │
│ Fatal Python error: Aborted

@Clint-chan
Copy link

I still meet this problem in version 0.6.2
INFO 10-12 11:34:09 logger.py:36] Received request chat-44505254559d4a72ad36a008ebbfbbdf: prompt: '<|im_start|>system\n你是一个专业且精确的语言判断和翻译工具,你的任务是判断用户输入的字符串是什么语言,并将它翻译为英语,仅需要输出翻译后的结果,不需要描述你的思路或补充性说明等。保持简洁的描述。\n\n输入类型:字符串,可能是任何语言,也可能是几种语言的混合,也有可能为空 \n输出类型:用户输入的字符串转化为英文后的结果,并用一对连续的英文的大括号包裹。如果用户输入为空,那么输出空值。不需要加入任何前缀后缀或说明性语句,例如“以下是翻译结果”,“Below you are handling the string: ”等,直接输出用大括号包裹后的结果即可。如果你无法理解用户发送的内容,或者用户发送的内容是无意义的字符串,乱码等,你可以直接返回一个用一对大括号包裹的原始字符串。\n\n---\n\n示例输入1 \nThe Seitai Shinpo Acupuncture Foundation is a non-profit organization whose mission is to foster the development and training of expert teachers and proficient practitioners so that they may perpetuate the wisdom of Seitai Shinpo.\n\n示例输出1 \n{{The Seitai Shinpo Acupuncture Foundation is a non-profit organization whose mission is to foster the development and training of expert teachers and proficient practitioners so that they may perpetuate the wisdom of Seitai Shinpo.}}\n\n示例输入2 \n沙特阿拉伯,Mecca的酒店在线预订。良好的可用性和优惠。便宜和安全,在酒店支付,不收预订费。\n\n示例输出2 \n{{Online hotel booking in Mecca, Saudi Arabia. Good availability and discounts. Affordable and safe, pay at the hotel, no booking fees.}}\n\n---\n\n注意:不需要输出任何描述性语句或解释性说明,仅仅输出解析后的字符串即可。<|im_end|>\n<|im_start|>user\n下面你要处理的字符串:กิจกรรมการบริการอื่น ๆ ส่วนบุคคลซึ่งมิได้จัดประเภทไว้ในที่อื่น<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7760, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 56568, 101909, 99878, 100136, 108639, 109824, 104317, 33108, 105395, 102011, 3837, 103929, 88802, 20412, 104317, 20002, 31196, 9370, 66558, 102021, 102064, 90395, 44063, 99652, 105395, 17714, 104105, 3837, 99373, 85106, 66017, 105395, 104813, 59151, 3837, 104689, 53481, 103929, 104337, 57191, 104361, 33071, 66394, 49567, 1773, 100662, 110485, 9370, 53481, 3407, 334, 31196, 31905, 334, 5122, 66558, 3837, 104560, 99885, 102064, 3837, 74763, 104560, 108464, 102064, 9370, 105063, 3837, 74763, 102410, 50647, 2303, 334, 66017, 31905, 334, 5122, 20002, 31196, 9370, 66558, 106474, 105205, 104813, 59151, 90395, 11622, 103219, 104005, 9370, 105205, 104197, 100139, 17992, 108232, 1773, 62244, 20002, 31196, 50647, 3837, 100624, 66017, 34794, 25511, 1773, 104689, 101963, 99885, 24562, 103630, 33447, 103630, 57191, 66394, 33071, 72881, 99700, 3837, 77557, 2073, 114566, 105395, 59151, 33590, 2073, 38214, 498, 525, 11589, 279, 914, 25, 18987, 49567, 3837, 101041, 66017, 11622, 26288, 100139, 17992, 108232, 104813, 59151, 104180, 1773, 102056, 101068, 101128, 20002, 72017, 104597, 3837, 100631, 20002, 72017, 104597, 20412, 42192, 100240, 9370, 66558, 3837, 100397, 16476, 49567, 3837, 105048, 101041, 31526, 46944, 11622, 103219, 26288, 100139, 17992, 108232, 9370, 105966, 66558, 3407, 44364, 334, 19793, 26355, 31196, 16, 334, 2303, 785, 96520, 2143, 34449, 5368, 6381, 68462, 5007, 374, 264, 2477, 27826, 7321, 6693, 8954, 374, 311, 29987, 279, 4401, 323, 4862, 315, 6203, 13336, 323, 68265, 42095, 773, 429, 807, 1231, 21585, 6292, 279, 23389, 315, 96520, 2143, 34449, 5368, 382, 334, 19793, 26355, 66017, 16, 334, 2303, 2979, 785, 96520, 2143, 34449, 5368, 6381, 68462, 5007, 374, 264, 2477, 27826, 7321, 6693, 8954, 374, 311, 29987, 279, 4401, 323, 4862, 315, 6203, 13336, 323, 68265, 42095, 773, 429, 807, 1231, 21585, 6292, 279, 23389, 315, 96520, 2143, 34449, 5368, 13, 47449, 334, 19793, 26355, 31196, 17, 334, 2303, 111662, 111946, 3837, 7823, 24441, 9370, 101078, 99107, 109545, 1773, 104205, 107769, 105178, 102289, 1773, 104698, 33108, 99464, 96050, 101078, 68262, 3837, 16530, 50009, 109545, 80268, 3407, 334, 19793, 26355, 66017, 17, 334, 2303, 2979, 19598, 9500, 21857, 304, 2157, 24441, 11, 17904, 23061, 13, 7684, 18048, 323, 31062, 13, 42506, 323, 6092, 11, 2291, 518, 279, 9500, 11, 902, 21857, 12436, 13, 47449, 44364, 60533, 5122, 104689, 66017, 99885, 53481, 33071, 72881, 99700, 57191, 104136, 33071, 66394, 3837, 102630, 66017, 106637, 104813, 66558, 104180, 1773, 151645, 198, 151644, 872, 198, 100431, 105182, 54542, 9370, 66558, 5122, 25200, 30785, 60416, 124701, 93874, 125331, 30785, 93874, 22929, 64684, 20184, 128630, 129328, 124659, 36142, 47642, 40327, 124358, 123885, 123883, 18625, 30434, 26283, 30785, 127196, 19841, 60416, 124090, 132814, 125497, 19841, 124202, 35884, 47171, 22929, 64684, 20184, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO: 116.247.118.146:42270 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 10-12 11:34:10 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241012-113410.pkl...
WARNING 10-12 11:34:10 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
WARNING 10-12 11:34:10 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
WARNING 10-12 11:34:10 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
WARNING 10-12 11:34:10 model_runner_base.py:143] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
WARNING 10-12 11:34:10 model_runner_base.py:143]
[rank0]:[E1012 11:34:10.718322889 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7fc4a4cf86 in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f7fc49fbd10 in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f7fc4b27f08 in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f7fc5d443e6 in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f7fc5d49600 in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f7fc5d502ba in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7fc5d526fc in /raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f8013500bf4 in /raid/demo/anaconda3/envs/vllm_latest/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f8014f19609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f8014ce4353 in /lib/x86_64-linux-gnu/libc.so.6)

INFO: 61.171.72.231:17915 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 61.171.72.231:43655 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 61.171.72.231:32518 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 61.171.72.231:54509 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: 61.171.72.231:32519 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 257, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 253, in wrap
await func()
File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
message = await receive()
^^^^^^^^^^^^^^^
File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
await self.message_event.wait()
File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/asyncio/locks.py", line 212, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f83c7304a40

During handling of the above exception, another exception occurred:

  • Exception Group Traceback (most recent call last):
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    | result = await app( # type: ignore[func-returns-value]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
    | return await self.app(scope, receive, send)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
    | await super().call(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/applications.py", line 113, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in call
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in call
    | await self.app(scope, receive, _send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
    | await self.app(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in call
    | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    | await app(scope, receive, sender)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 715, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
    | await route.handle(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
    | await self.app(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    | await app(scope, receive, sender)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
    | await response(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 250, in call
    | async with anyio.create_task_group() as task_group:
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 736, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
    +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 253, in wrap
    | await func()
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 242, in stream_response
    | async for chunk in self.body_iterator:
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 309, in chat_completion_stream_generator
    | async for res in result_generator:
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/vllm/utils.py", line 452, in iterate_with_cancellation
    | item = await awaits[0]
    | ^^^^^^^^^^^^^^^
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    | raise request_output
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    | raise request_output
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/vllm/engine/multiprocessing/client.py", line 486, in _process_request
    | raise request_output
    | [Previous line repeated 11 more times]
    | RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
    | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
    | Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
    |
    +------------------------------------
    ERROR: Exception in ASGI application
    Traceback (most recent call last):
    File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 257, in call
    await wrap(partial(self.listen_for_disconnect, receive))
    File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 253, in wrap
    await func()
    File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 230, in listen_for_disconnect
    message = await receive()
    ^^^^^^^^^^^^^^^
    File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
    File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/asyncio/locks.py", line 212, in wait
    await fut
    asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f841c12a6c0

During handling of the above exception, another exception occurred:

  • Exception Group Traceback (most recent call last):
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    | result = await app( # type: ignore[func-returns-value]
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
    | return await self.app(scope, receive, send)
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in call
    | await super().call(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/applications.py", line 113, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in call
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in call
    | await self.app(scope, receive, _send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/cors.py", line 85, in call
    | await self.app(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in call
    | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    | await app(scope, receive, sender)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 715, in call
    | await self.middleware_stack(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
    | await route.handle(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
    | await self.app(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    | raise exc
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    | await app(scope, receive, sender)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
    | await response(scope, receive, send)
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/starlette/responses.py", line 250, in call
    | async with anyio.create_task_group() as task_group:
    | File "/raid/demo/anaconda3/envs/vllm_latest/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 736, in aexit
    | raise BaseExceptionGroup(
    | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants