[BUG] Qwen3 MoE  30B-A3B training stuck

related issue: https://github.com/OpenRLHF/OpenRLHF/issues/1097

reproduce script

```
set -x

pip install openrlhf

deepspeed --module openrlhf.cli.train_sft \
    --max_len 2048 \
    --dataset Open-Orca/OpenOrca \
    --input_key question \
    --output_key response \
    --train_batch_size 128 \
    --micro_train_batch_size 1 \
    --max_samples 2048 \
    --pretrain Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --save_path ./checkpoint/llama3-8b-sft \
    --save_steps -1 \
    --logging_steps 1 \
    --eval_steps -1 \
    --zero_stage 3 \
    --max_epochs 1 \
    --bf16 \
    --flash_attn \
    --learning_rate 5e-6 \
    --gradient_checkpointing \
    --packing_samples \
    --adam_offload \
    --ring_attn_size 1

```

logs
```
Train epoch:   0%|                                                                                                                                                              | 0/1 [00:00<?, ?it/s[rank0]:[E801 01:06:56.381097551 ProcessGroupNCCL.cpp:632] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=393216, NumelOut=1572864, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank3]:[E801 01:06:56.382370929 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=512, NumelOut=2048, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
[rank3]:[E801 01:06:56.384721086 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 75813 PG status: last enqueued work: 75816, last completed work: 75812
[rank0]:[E801 01:06:56.384723360 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 75813 PG status: last enqueued work: 75815, last completed work: 75812
[rank3]:[E801 01:06:56.384969271 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E801 01:06:56.384973319 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E801 01:06:56.385519523 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping.
[rank3]:[E801 01:06:56.385533529 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping.
[rank2]:[E801 01:06:56.452565616 ProcessGroupNCCL.cpp:632] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=393216, NumelOut=1572864, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank2]:[E801 01:06:56.452792080 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 75813 PG status: last enqueued work: 75815, last completed work: 75812
[rank2]:[E801 01:06:56.452823880 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E801 01:06:56.452920351 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 2] First PG on this rank to signal dumping.
[rank1]:[E801 01:06:56.453133290 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=393216, NumelOut=1572864, Timeout(ms)=600000) ran for 600096 milliseconds before timing out.
[rank1]:[E801 01:06:56.453346210 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 75813 PG status: last enqueued work: 75815, last completed work: 75812
[rank1]:[E801 01:06:56.453378280 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E801 01:06:56.453472476 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank3]:[E801 01:06:57.804422478 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 75816, last completed NCCL work: 75812.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank3]:[E801 01:06:57.805042520 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank3]:[E801 01:06:57.814985064 ProcessGroupNCCL.cpp:684] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E801 01:06:57.815012877 ProcessGroupNCCL.cpp:698] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E801 01:06:57.820446145 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=512, NumelOut=2048, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f20c6d785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f20717f8a6d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f20717fa7f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f20717fbefd in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7f2167864db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9caa4 (0x7f216b330aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: __clone + 0x44 (0x7f216b3bda34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=75813, OpType=_ALLGATHER_BASE, NumelIn=512, NumelOut=2048, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f20c6d785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f20717f8a6d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f20717fa7f0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f20717fbefd in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xecdb4 (0x7f2167864db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x9caa4 (0x7f216b330aa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: __clone + 0x44 (0x7f216b3bda34 in /usr/lib/x86_64-linux-gnu/libc.so.6)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Qwen3 MoE 30B-A3B training stuck #7461

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Qwen3 MoE 30B-A3B training stuck #7461

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions