Some NCCL operations have failed or timed out. #47

dbcSep03 · 2024-04-18T04:19:43Z

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练，感觉是训练完第一个epoch就出现这个错误
我使用的是实现的train.py文件
感觉是不是评估的时候，前面进程没结束
添加个accelerator.wait_for_everyone()
谢谢解答！

dbcSep03 · 2024-04-18T04:23:32Z

这是wandb log的，可以在最后一个step 也相差了不少，怎么解决呢

charent · 2024-04-20T00:27:51Z

你把数据集大小调成几十条，看能不能跑完一个epoch，我这边不能复现你这个问题。搜到的原因说可能是cuda oom、通信异常之类的。

或者你设置一下检查点，最后一步不能保存，用之前的检查点。

或者把评估去掉

weisili2016 · 2024-04-24T05:26:09Z

@dbcSep03 最后怎么处理的，遇到同样问题

dbcSep03 · 2024-04-24T12:13:56Z

我使用单卡训练了6天，我在网上看到的是因为 eval的时候是在单卡上进行的，结果超过了30min，代码停止了。可以将等待时间设置为无限长

AlexGao-XDU · 2024-06-22T03:23:32Z

我使用单卡训练了6天，我在网上看到的是因为 eval的时候是在单卡上进行的，结果超过了30min，代码停止了。可以将等待时间设置为无限长

麻烦问下双卡训练有没有解决办法，等待时间在哪设置呢，谢谢

charent · 2024-09-22T02:37:13Z

我使用单卡训练了6天，我在网上看到的是因为 eval的时候是在单卡上进行的，结果超过了30min，代码停止了。可以将等待时间设置为无限长

麻烦问下双卡训练有没有解决办法，等待时间在哪设置呢，谢谢

把调用eval评估的代码去掉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some NCCL operations have failed or timed out. #47

Some NCCL operations have failed or timed out. #47

dbcSep03 commented Apr 18, 2024

dbcSep03 commented Apr 18, 2024

charent commented Apr 20, 2024 •

edited

Loading

weisili2016 commented Apr 24, 2024

dbcSep03 commented Apr 24, 2024

AlexGao-XDU commented Jun 22, 2024

charent commented Sep 22, 2024

Some NCCL operations have failed or timed out. #47

Some NCCL operations have failed or timed out. #47

Comments

dbcSep03 commented Apr 18, 2024

dbcSep03 commented Apr 18, 2024

charent commented Apr 20, 2024 • edited Loading

weisili2016 commented Apr 24, 2024

dbcSep03 commented Apr 24, 2024

AlexGao-XDU commented Jun 22, 2024

charent commented Sep 22, 2024

charent commented Apr 20, 2024 •

edited

Loading