Description
Your current environment
hardwark: A800
Driver Version: 535.54.03 CUDA Version: 12.2
vllm commit d3a2451
model qwen72B
Model Input Dumps
No response
🐛 Describe the bug
INFO 10-30 11:46:47 async_llm_engine.py:173] Added request 541ca4832eb9436180e721ef069baedb.
ERROR 10-30 11:47:32 async_llm_engine.py:656] Engine iteration timed out. This should never happen!
ERROR 10-30 11:47:32 async_llm_engine.py:56] Engine background task failed
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56] done, _ = await asyncio.wait(
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
ERROR 10-30 11:47:32 async_llm_engine.py:56] return await _wait(fs, timeout, return_when, loop)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
ERROR 10-30 11:47:32 async_llm_engine.py:56] await waiter
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 10-30 11:47:32 async_llm_engine.py:56]
ERROR 10-30 11:47:32 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 10-30 11:47:32 async_llm_engine.py:56]
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 10-30 11:47:32 async_llm_engine.py:56] return_value = task.result()
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56] await asyncio.sleep(0)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
ERROR 10-30 11:47:32 async_llm_engine.py:56] self._do_exit(exc_type)
ERROR 10-30 11:47:32 async_llm_engine.py:56] File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 10-30 11:47:32 async_llm_engine.py:56] raise asyncio.TimeoutError
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - asyncio:default_exception_handler:1753 - ERROR: Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
await asyncio.sleep(0)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 10-30 11:47:32 async_llm_engine.py:180] Aborted request 6838fbb7076948a7a1f8071d4095c740.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/ailab/inference_wrapper/huggingface/lora/nlp/wrapper_vllm.py", line 621, in _process_stream_infence
async for request_output in results_generator:
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 770, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 886, in _process_request
raise e
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 882, in _process_request
async for request_output in stream:
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 93, in anext
raise result
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
await asyncio.sleep(0)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - wrapper:_process_stream_infence:645 - ERROR: streaming inference exception, 6838fbb7076948a7a1f8071d4095c740
(VllmWorkerProcess pid=198) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=199) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=198) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=200) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second.
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 2] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 3] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 1] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)
ERROR 10-30 11:56:48 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 198 died, exit code: -6
INFO 10-30 11:56:48 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 2 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=9
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 2 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 2 Rank 0] [PG 2 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 9
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.