Skip to content

GRPO training timeout after trl vllm-serve 'reset prefix cache' #658

Open
@mgcyung

Description

@mgcyung

A error "Operation timed out!" is reported in GRPO training as follows

Exception ignored in: <finalize object at 0x7f7d7b9bc920; dead>
Traceback (most recent call last):
  File "/root/miniconda3/envs/open-r1/lib/python3.12/weakref.py", line 585, in __call__
    def __call__(self, _=None):

  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/math_verify/utils.py", line 43, in handler
    raise TimeoutError("Operation timed out!")
TimeoutError: Operation timed out!
[rank2]:[E516 06:09:16.295666043 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
[rank1]:[E516 06:09:16.296189442 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank1]:[E516 06:09:16.297295298 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank1]:[E516 06:09:16.297336852 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.297331820 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank2]:[E516 06:09:16.297375817 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.678756367 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E516 06:09:16.678781139 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E516 06:09:16.682309480 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f602a4581b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f602b7a1c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f602b7a37d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f602b7a46ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f60748275c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f609167f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f609144a133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E516 06:09:17.984764325 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E516 06:09:17.984795568 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E516 06:09:17.987988525 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75077cb1b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f7508b14c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f7508b167d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f7508b176ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7551b9a5c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f756e9f5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f756e7c0133 in /lib/x86_64-linux-gnu/libc.so.6)

W0516 06:09:17.399000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593730 closing signal SIGTERM
W0516 06:09:17.402000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593731 closing signal SIGTERM
E0516 06:09:17.718000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 3593732) of binary: /root/miniconda3/envs/open-r1/bin/python3.12
Traceback (most recent call last):
  File "/root/miniconda3/envs/open-r1/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
src/open_r1/grpo.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-16_06:09:17
  host      : 355be013b6cd
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 3593732)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3593732
========================================================

in about thirty minutes after trl vllm-serve 'reset prefix cache'

INFO:     127.0.0.1:55546 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO 05-16 05:38:52 [block_pool.py:264] Successfully reset prefix cache
INFO:     127.0.0.1:55546 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 192/192 [00:20<00:00,  9.52it/s, est. speed input: 1573.88 toks/s, output: 10739.85 toks/s]
INFO:     127.0.0.1:55546 - "POST /generate/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34192 - "GET /health/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34200 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34214 - "POST /init_communicator/ HTTP/1.1" 200 OK

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions