-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingdesignIncludes a design discussionIncludes a design discussiondiscussionIn a discussion stageIn a discussion stagehelp wantedOpen to be worked onOpen to be worked on
Milestone
Description
🐛 Bug
If any callback implements on_save_checkpoint
, then that function runs only in the rank zero worker. I think this is suboptimal as you might want to do some communication across workers before saving state.
The lineage of calls here is:
- model checkpoint callback's on_validation_end is decorated with rank_zero_only
- this calls the checkpoint callback's _save_model
- which calls it's save_function
- the save_function is trainer.save_checkpoint
- save_checkpoint in checkpoint_connector calls dump_checkpoint
- dump_checkpoint calls trainer.on_save_checkpoint()
- trainer on_save_checkpoint() calls any callback implementing on_save_checkpoint
I think this could be avoided with more judicious usage of rank_zero_only
. the main benefit of rank_zero_only
in the model checkpoint callback to avoid redundant file I/O. For saving the checkpoint, that is taken care of by this check.
Other file I/O in the model checkpoint callback could be similarly guarded, and we should remove the decorator from on_validation_end
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdesignIncludes a design discussionIncludes a design discussiondiscussionIn a discussion stageIn a discussion stagehelp wantedOpen to be worked onOpen to be worked on