After DDP train processes have different best val paths #4319
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 0
High priority task
🐛 Bug
Tied to huggingface/transformers#7852
There is no synchronisation/communication to ensure the model has finished saving before loading. If you look at ddp_spawn/ddp_cpu there is communication to ensure that each process has the same best_val_path stored in the model after save.
Run below on multi-gpu:
Output:
Expected behavior
Assertion does not fail
The text was updated successfully, but these errors were encountered: