SFT时evaluate会卡住 #194

jiangtann · 2023-09-04T20:19:59Z

Describe the bug

运行sh run_sft.sh进行微调时，使用pdb调试，发现会卡在https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L913 这一行，但是evaluation_strategy能正常执行，即每过eval_steps步后的evaluate又是能正常运行的。

卡住的时候0号显卡的利用率一直是100%，看起来像是卡在torch.distributed的进程通信了一样，其他显卡显存占用大部分为0。

shibing624 · 2023-09-05T06:53:46Z

多卡调试，我不清楚。

fix similar to issue #194

jiangtann added the bug Something isn't working label Sep 4, 2023

shibing624 closed this as completed Sep 5, 2023

kinghuin added a commit to kinghuin/MedicalGPT that referenced this issue Sep 7, 2023

fix similar to shibing624#194

daa496b

shibing624 added a commit that referenced this issue Sep 8, 2023

Merge pull request #200 from kinghuin/patch-1

4f3a051

fix similar to issue #194

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT时evaluate会卡住 #194

SFT时evaluate会卡住 #194

jiangtann commented Sep 4, 2023

shibing624 commented Sep 5, 2023

SFT时evaluate会卡住 #194

SFT时evaluate会卡住 #194

Comments

jiangtann commented Sep 4, 2023

Describe the bug

shibing624 commented Sep 5, 2023