GRPO training timeout after trl vllm-serve 'reset prefix cache'

A error "Operation timed out!" is reported in GRPO training as follows

```
Exception ignored in: <finalize object at 0x7f7d7b9bc920; dead>
Traceback (most recent call last):
  File "/root/miniconda3/envs/open-r1/lib/python3.12/weakref.py", line 585, in __call__
    def __call__(self, _=None):

  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/math_verify/utils.py", line 43, in handler
    raise TimeoutError("Operation timed out!")
TimeoutError: Operation timed out!
[rank2]:[E516 06:09:16.295666043 ProcessGroupNCCL.cpp:629] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
[rank1]:[E516 06:09:16.296189442 ProcessGroupNCCL.cpp:629] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
[rank1]:[E516 06:09:16.297295298 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank1]:[E516 06:09:16.297336852 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.297331820 ProcessGroupNCCL.cpp:2168] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 49296 PG status: last enqueued work: 49296, last completed work: 49295
[rank2]:[E516 06:09:16.297375817 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E516 06:09:16.678756367 ProcessGroupNCCL.cpp:681] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E516 06:09:16.678781139 ProcessGroupNCCL.cpp:695] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E516 06:09:16.682309480 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800076 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f602a4581b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f602b7a1c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f602b7a37d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f602b7a46ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f60748275c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f609167f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f609144a133 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E516 06:09:17.984764325 ProcessGroupNCCL.cpp:681] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E516 06:09:17.984795568 ProcessGroupNCCL.cpp:695] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E516 06:09:17.987988525 ProcessGroupNCCL.cpp:1895] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49296, OpType=_ALLGATHER_BASE, NumelIn=192, NumelOut=576, Timeout(ms)=1800000) ran for 1800030 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75077cb1b6 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f7508b14c74 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f7508b167d0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f7508b176ed in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7551b9a5c0 in /root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x8609 (0x7f756e9f5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f756e7c0133 in /lib/x86_64-linux-gnu/libc.so.6)

W0516 06:09:17.399000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593730 closing signal SIGTERM
W0516 06:09:17.402000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3593731 closing signal SIGTERM
E0516 06:09:17.718000 3593512 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 2 (pid: 3593732) of binary: /root/miniconda3/envs/open-r1/bin/python3.12
Traceback (most recent call last):
  File "/root/miniconda3/envs/open-r1/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1182, in launch_command
    deepspeed_launcher(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/accelerate/commands/launch.py", line 861, in deepspeed_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/open-r1/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
src/open_r1/grpo.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-16_06:09:17
  host      : 355be013b6cd
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 3593732)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 3593732
========================================================
```

in about thirty minutes after trl vllm-serve 'reset prefix cache'

```
INFO:     127.0.0.1:55546 - "POST /update_named_param/ HTTP/1.1" 200 OK
INFO 05-16 05:38:52 [block_pool.py:264] Successfully reset prefix cache
INFO:     127.0.0.1:55546 - "POST /reset_prefix_cache/ HTTP/1.1" 200 OK
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████| 192/192 [00:20<00:00,  9.52it/s, est. speed input: 1573.88 toks/s, output: 10739.85 toks/s]
INFO:     127.0.0.1:55546 - "POST /generate/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34192 - "GET /health/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34200 - "GET /get_world_size/ HTTP/1.1" 200 OK
INFO:     127.0.0.1:34214 - "POST /init_communicator/ HTTP/1.1" 200 OK
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPO training timeout after trl vllm-serve 'reset prefix cache' #658

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GRPO training timeout after trl vllm-serve 'reset prefix cache' #658

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions