[Bug]: Engine iteration timed out. This should never happen!

### Your current environment

hardwark： A800
Driver Version: 535.54.03 CUDA Version: 12.2 
vllm commit d3a245138acb358c7e1e5c5dcf4dcb3c2b48c8ff
model qwen72B 


### Model Input Dumps

_No response_

### 🐛 Describe the bug

INFO 10-30 11:46:47 async_llm_engine.py:173] Added request 541ca4832eb9436180e721ef069baedb.
ERROR 10-30 11:47:32 async_llm_engine.py:656] Engine iteration timed out. This should never happen!
ERROR 10-30 11:47:32 async_llm_engine.py:56] Engine background task failed
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56]     done, _ = await asyncio.wait(
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
ERROR 10-30 11:47:32 async_llm_engine.py:56]     return await _wait(fs, timeout, return_when, loop)
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
ERROR 10-30 11:47:32 async_llm_engine.py:56]     await waiter
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.CancelledError
ERROR 10-30 11:47:32 async_llm_engine.py:56] 
ERROR 10-30 11:47:32 async_llm_engine.py:56] During handling of the above exception, another exception occurred:
ERROR 10-30 11:47:32 async_llm_engine.py:56] 
ERROR 10-30 11:47:32 async_llm_engine.py:56] Traceback (most recent call last):
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
ERROR 10-30 11:47:32 async_llm_engine.py:56]     return_value = task.result()
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
ERROR 10-30 11:47:32 async_llm_engine.py:56]     await asyncio.sleep(0)
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
ERROR 10-30 11:47:32 async_llm_engine.py:56]     self._do_exit(exc_type)
ERROR 10-30 11:47:32 async_llm_engine.py:56]   File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 10-30 11:47:32 async_llm_engine.py:56]     raise asyncio.TimeoutError
ERROR 10-30 11:47:32 async_llm_engine.py:56] asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - asyncio:default_exception_handler:1753 - ERROR:  Exception in callback _log_task_completion(error_callback=<bound method...7f3184b4e910>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36
handle: <Handle _log_task_completion(error_callback=<bound method...7f3184b4e910>>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py:36>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
    await asyncio.sleep(0)
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 58, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 10-30 11:47:32 async_llm_engine.py:180] Aborted request 6838fbb7076948a7a1f8071d4095c740.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 629, in run_engine_loop
    done, _ = await asyncio.wait(
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 413, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 525, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/ailab/inference_wrapper/huggingface/lora/nlp/wrapper_vllm.py", line 621, in _process_stream_infence
    async for request_output in results_generator:
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 770, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 886, in _process_request
    raise e
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 882, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 93, in __anext__
    raise result
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 46, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 633, in run_engine_loop
    await asyncio.sleep(0)
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 95, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.9/site-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2024-10-30 11:47:32,282 - wrapper:_process_stream_infence:645 - ERROR:  streaming inference exception, 6838fbb7076948a7a1f8071d4095c740
(VllmWorkerProcess pid=198) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:47:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:48:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:49:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:50:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:51:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:52:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:53:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:54:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:55:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=199) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=198) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second. 
(VllmWorkerProcess pid=200) WARNING 10-30 11:56:32 shm_broadcast.py:386] No available block found in 60 second. 
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 2] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600053 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 3] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 1] Timeout at NCCL work: 12471541, last enqueued NCCL work: 12471541, last completed NCCL work: 12471540.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 2 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12471541, OpType=GATHER, NumelIn=1520640, NumelOut=0, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f3095e6dc62 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f3095e72a80 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f3095e73dcc in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #5: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f318fecf897 in /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe32119 (0x7f3095af7119 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x245c0 (0x7f3226ad05c0 in /home/aiges/library/libuds.so)
frame #3: <unknown function> + 0x94ac3 (0x7f322406bac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f32240fca04 in /lib/x86_64-linux-gnu/libc.so.6)

ERROR 10-30 11:56:48 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 198 died, exit code: -6
INFO 10-30 11:56:48 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[E ProcessGroupNCCL.cpp:1316] [PG 2 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=9
[rank0]:[E ProcessGroupNCCL.cpp:1153] [PG 2 Rank 0] ProcessGroupNCCL preparing to dump debug info.
[rank0]:[F ProcessGroupNCCL.cpp:1169] [PG 2 Rank 0] [PG 2 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 9



### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Engine iteration timed out. This should never happen! #9839

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Engine iteration timed out. This should never happen! #9839

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions