Skip to content

on_save_checkpoint callbacks runs in rank zero only #3545

@ananthsub

Description

@ananthsub

🐛 Bug

If any callback implements on_save_checkpoint, then that function runs only in the rank zero worker. I think this is suboptimal as you might want to do some communication across workers before saving state.

The lineage of calls here is:

I think this could be avoided with more judicious usage of rank_zero_only. the main benefit of rank_zero_only in the model checkpoint callback to avoid redundant file I/O. For saving the checkpoint, that is taken care of by this check.

Other file I/O in the model checkpoint callback could be similarly guarded, and we should remove the decorator from on_validation_end

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdesignIncludes a design discussiondiscussionIn a discussion stagehelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions