-
-
Notifications
You must be signed in to change notification settings - Fork 655
Feature add ZeRO support to Checkpoint in a distributed configuration #2642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature add ZeRO support to Checkpoint in a distributed configuration #2642
Conversation
40eadb4
to
ecda83e
Compare
ecda83e
to
5c8be79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sadra-barikbin thanks for the update, I left few comments to address
5c8be79
to
f239327
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sadra-barikbin
@sadra-barikbin GPU tests are broken : https://github.com/pytorch/ignite/runs/7914418582?check_suite_focus=true |
…pytorch#2642) * Implement feature * Fix bug in docstring * Fix bugs and tests * Handle pytorch<1.9.0 * Fix mypy error
Fixes #2623
Description: To call
consolidate_state_dict
method of ZeRO optimizer on all ranks to before callingstate_dict
on target rank, the one responsible for saving the checkpoint.Check list: