Skip to content

nccl time out  #324

Closed
Closed
@zsp1993

Description

@zsp1993

Hi, I encounter NCCL timeout error at the end of each epoch during training.
Here is part of the error message.
Epoch 6: 100%|██████████████████████████████████████████████████████████████████████████| 13988/13988 [2:18:37<00:00, 1.68it/s, loss=4.75, v_num=0Epoch 6, global step 49999: val_ssim_fid100_f1_total_mean reached 0.91746 (best 0.91746), saving model to "/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-24_00-30-46_train_lama-fourier_/models/epoch=6-step=49999.ckpt" as top 5
Epoch 7: 0%| | 0/13988 [00:00<?, ?it/s, loss=4.75, v_num=0]
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 100015, last enqueued NCCL work: 100022, last completed NCCL work: 100014.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=100015, OpType=ALLREDUCE, NumelIn=12673, NumelOut=12673, Timeout(ms)=600000) ran for 600094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f87cacd6897 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f87cbfb11b2 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f87cbfb5fd0 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f87cbfb731c in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f8817a68bf4 in /home/jovyan/zsp01/miniconda3/envs/lama-py3.10/bin/../lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f88199c4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f881978f133 in /lib/x86_64-linux-gnu/libc.so.6)

scripts_zsp/train_gaoping.sh:行 6: 5139 已放弃 (核心已转储) CUDA_VISIBLE_DEVICES=0,1 python bin/train.py -cn lama-fourier location=gaoping data.batch_size=40 +trainer.kwargs.resume_from_checkpoint=/home/jovyan/zsp01/workplace/lama/experiments/root_2024-07-23_14-44-09_train_lama-fourier_/models/last.ckpt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions