Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFT时evaluate会卡住 #194

Closed
jiangtann opened this issue Sep 4, 2023 · 1 comment
Closed

SFT时evaluate会卡住 #194

jiangtann opened this issue Sep 4, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@jiangtann
Copy link
Contributor

Describe the bug

运行sh run_sft.sh进行微调时,使用pdb调试,发现会卡在https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L913 这一行,但是evaluation_strategy能正常执行,即每过eval_steps步后的evaluate又是能正常运行的。

卡住的时候0号显卡的利用率一直是100%,看起来像是卡在torch.distributed的进程通信了一样,其他显卡显存占用大部分为0。

@jiangtann jiangtann added the bug Something isn't working label Sep 4, 2023
@shibing624
Copy link
Owner

多卡调试,我不清楚。

kinghuin added a commit to kinghuin/MedicalGPT that referenced this issue Sep 7, 2023
shibing624 added a commit that referenced this issue Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants