Description
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
vllm 0.5.4
🐛 Describe the bug
目前在8 * A800上进行推理,vllm推理的40b fp8版本,然后tp=8,一开始能推理,试了几个case以后就通信异常了
INFO 08-22 06:23:22 metrics.py:406] Avg prompt throughput: 36.2 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:32 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:42 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:23:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=161) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:24:42 shm_broadcast.py:386] No available block found in 60 second.
ERROR 08-22 06:24:42 async_llm_engine.py:663] Engine iteration timed out. This should never happen!
ERROR 08-22 06:24:42 async_llm_engine.py:57] Engine background task failed
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] done, _ = await asyncio.wait(
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] return await _wait(fs, timeout, return_when, loop)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
ERROR 08-22 06:24:42 async_llm_engine.py:57] await waiter
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.CancelledError
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] During handling of the above exception, another exception occurred:
ERROR 08-22 06:24:42 async_llm_engine.py:57]
ERROR 08-22 06:24:42 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 08-22 06:24:42 async_llm_engine.py:57] return_value = task.result()
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
ERROR 08-22 06:24:42 async_llm_engine.py:57] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
ERROR 08-22 06:24:42 async_llm_engine.py:57] self._do_exit(exc_type)
ERROR 08-22 06:24:42 async_llm_engine.py:57] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
ERROR 08-22 06:24:42 async_llm_engine.py:57] raise asyncio.TimeoutError
ERROR 08-22 06:24:42 async_llm_engine.py:57] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py:37>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 636, in run_engine_loop
done, _ = await asyncio.wait(
File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
return_value = task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 635, in run_engine_loop
async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 95, in aexit
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_timeout.py", line 178, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 59, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for theactual cause.
INFO 08-22 06:24:42 async_llm_engine.py:181] Aborted request chat-0d19f933a5304a3890a8ffa2c2bc69e9.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in call
await wrap(partial(self.listen_for_disconnect, receive))
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f45dbd9bf10
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in call
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in aexit
raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
INFO 08-22 06:24:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:24:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=162) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=158) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:25:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:25:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:25:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=158) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=163) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=161) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=160) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=164) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=159) WARNING 08-22 06:26:42 shm_broadcast.py:386] No available block found in 60 second.
INFO 08-22 06:26:47 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:26:57 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:07 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:17 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 08-22 06:27:37 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
(VllmWorkerProcess pid=163) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
(VllmWorkerProcess pid=162) WARNING 08-22 06:27:42 shm_broadcast.py:386] No available block found in 60 second.
[rank1]:[E822 06:33:42.457707445 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600073 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7f46409eaa84 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E822 06:33:42.472676771 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank7]:[E822 06:33:42.472710001 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank2]:[E822 06:33:42.472732829 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472743993 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 2] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank2]:[E822 06:33:42.472748107 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E822 06:33:42.472750861 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.472758671 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472765960 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 7] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank7]:[E822 06:33:42.472770740 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E822 06:33:42.472775007 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E822 06:33:42.473328595 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600090 milliseconds before timing out.
[rank6]:[E822 06:33:42.473378363 ProcessGroupNCCL.cpp:1664] [PG 3 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473386694 ProcessGroupNCCL.cpp:1709] [PG 3 Rank 6] Timeout at NCCL work: 4554, last enqueued NCCL work: 4554, last completed NCCL work: 4553.
[rank6]:[E822 06:33:42.473392166 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E822 06:33:42.473396865 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E822 06:33:42.473871273 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4554, OpType=GATHER, NumelIn=16032, NumelOut=0, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f468ed77f86 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4640d588d2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4640d5f313 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4640d616fc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f468e4b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f468fbdcac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f468fc6dbf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)