-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some NCCL operations have failed or timed out. #47
Comments
你把数据集大小调成几十条,看能不能跑完一个epoch,我这边不能复现你这个问题。搜到的原因说可能是cuda oom、通信异常之类的。 或者你设置一下检查点,最后一步不能保存,用之前的检查点。 或者把评估去掉 |
@dbcSep03 最后怎么处理的,遇到同样问题 |
我使用单卡训练了6天,我在网上看到的是因为 eval的时候 是在单卡上进行的,结果超过了30min,代码停止了。可以将等待时间设置为无限长 |
麻烦问下双卡训练有没有解决办法,等待时间在哪设置呢,谢谢 |
把调用eval评估的代码去掉 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)
我是双卡训练,感觉是训练完第一个epoch就出现这个错误
我使用的是实现的train.py文件
感觉是不是评估的时候,前面进程没结束
添加个accelerator.wait_for_everyone()
谢谢解答!
The text was updated successfully, but these errors were encountered: