We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
运行sh run_sft.sh进行微调时,使用pdb调试,发现会卡在https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L913 这一行,但是evaluation_strategy能正常执行,即每过eval_steps步后的evaluate又是能正常运行的。
卡住的时候0号显卡的利用率一直是100%,看起来像是卡在torch.distributed的进程通信了一样,其他显卡显存占用大部分为0。
The text was updated successfully, but these errors were encountered:
多卡调试,我不清楚。
Sorry, something went wrong.
fix similar to shibing624#194
daa496b
Merge pull request #200 from kinghuin/patch-1
4f3a051
fix similar to issue #194
No branches or pull requests
Describe the bug
运行sh run_sft.sh进行微调时,使用pdb调试,发现会卡在https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L913 这一行,但是evaluation_strategy能正常执行,即每过eval_steps步后的evaluate又是能正常运行的。
卡住的时候0号显卡的利用率一直是100%,看起来像是卡在torch.distributed的进程通信了一样,其他显卡显存占用大部分为0。
The text was updated successfully, but these errors were encountered: