Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some NCCL operations have failed or timed out. #47

Open
dbcSep03 opened this issue Apr 18, 2024 · 6 comments
Open

Some NCCL operations have failed or timed out. #47

dbcSep03 opened this issue Apr 18, 2024 · 6 comments

Comments

@dbcSep03
Copy link

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练,感觉是训练完第一个epoch就出现这个错误
我使用的是实现的train.py文件
感觉是不是评估的时候,前面进程没结束
添加个accelerator.wait_for_everyone()
谢谢解答!

@dbcSep03
Copy link
Author

image
这是wandb log的,可以在最后一个step 也相差了不少,怎么解决呢

@charent
Copy link
Owner

charent commented Apr 20, 2024

你把数据集大小调成几十条,看能不能跑完一个epoch,我这边不能复现你这个问题。搜到的原因说可能是cuda oom、通信异常之类的。

或者你设置一下检查点,最后一步不能保存,用之前的检查点。

或者把评估去掉

@weisili2016
Copy link

@dbcSep03 最后怎么处理的,遇到同样问题

@dbcSep03
Copy link
Author

我使用单卡训练了6天,我在网上看到的是因为 eval的时候 是在单卡上进行的,结果超过了30min,代码停止了。可以将等待时间设置为无限长

@AlexGao-XDU
Copy link

我使用单卡训练了6天,我在网上看到的是因为 eval的时候 是在单卡上进行的,结果超过了30min,代码停止了。可以将等待时间设置为无限长

麻烦问下双卡训练有没有解决办法,等待时间在哪设置呢,谢谢

@charent
Copy link
Owner

charent commented Sep 22, 2024

我使用单卡训练了6天,我在网上看到的是因为 eval的时候 是在单卡上进行的,结果超过了30min,代码停止了。可以将等待时间设置为无限长

麻烦问下双卡训练有没有解决办法,等待时间在哪设置呢,谢谢

把调用eval评估的代码去掉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants