Open
Description
A error "Operation timed out!" is reported in GRPO training as follows
Exception ignored in: <finalize object at 0x7f7d7b9bc920; dead>
Traceback (most recent call last):
File "/root/miniconda3/envs/open-r1/lib/python3.12/weakref.py", line 585, in __call__
def __call__(self, _=None):
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/math_verify/utils.py", line 43, in handler
raise TimeoutError("Operation timed out!")
TimeoutError: Operation timed out!
[rank2]:[E516 06:09:16.295666043 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
[rank1]:[E516 06:09:16.296189442 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank1]:[E516 06:09:16.297295298 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank1]:[E516 06:09:16.297336852 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.297331820 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank2]:[E516 06:09:16.297375817 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.678756367 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E516 06:09:16.678781139 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E516 06:09:16.682309480 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f602a4581b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f602b7a1c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f602b7a37d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f602b7a46ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f60748275c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f609167f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f609144a133 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E516 06:09:17.984764325 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E516 06:09:17.984795568 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E516 06:09:17.987988525 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75077cb1b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f7508b14c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f7508b167d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f7508b176ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7551b9a5c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f756e9f5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f756e7c0133 in /lib/x86_64-linux-gnu/libc.so.6)
W0516 06:09:17.399000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593730 closing signal SIGTERM
W0516 06:09:17.402000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593731 closing signal SIGTERM
E0516 06:09:17.718000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 3593732) of binary: /root/miniconda3/envs/open-r1/bin/python3.12
Traceback (most recent call last):
File "/root/miniconda3/envs/open-r1/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
deepspeed_launcher(args)
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
src/open_r1/grpo.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-16_06:09:17
host : 355be013b6cd
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3593732)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3593732
========================================================
in about thirty minutes after trl vllm-serve 'reset prefix cache'
INFO: 127.0.0.1:55546 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO 05-16 05:38:52 [block_pool.py:264] Successfully reset prefix cache
INFO: 127.0.0.1:55546 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 192/192 [00:20<00:00, 9.52it/s, est. speed input: 1573.88 toks/s, output: 10739.85 toks/s]
INFO: 127.0.0.1:55546 - "POST /generate/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:34192 - "GET /health/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:34200 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO: 127.0.0.1:34214 - "POST /init_communicator/ HTTP/1.1" 200 OK
Metadata
Metadata
Assignees
Labels
No labels