Skip to content

Conversation

sadra-barikbin
Copy link
Collaborator

Fixes #2623

Description: To call consolidate_state_dict method of ZeRO optimizer on all ranks to before calling state_dict on target rank, the one responsible for saving the checkpoint.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: handlers Core Handlers module label Aug 13, 2022
@sadra-barikbin sadra-barikbin force-pushed the feature-add-ZeRO-support-to-checkpoint-issue-2623 branch from 40eadb4 to ecda83e Compare August 13, 2022 02:27
@sadra-barikbin sadra-barikbin force-pushed the feature-add-ZeRO-support-to-checkpoint-issue-2623 branch from ecda83e to 5c8be79 Compare August 16, 2022 02:55
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadra-barikbin thanks for the update, I left few comments to address

@sadra-barikbin sadra-barikbin force-pushed the feature-add-ZeRO-support-to-checkpoint-issue-2623 branch from 5c8be79 to f239327 Compare August 19, 2022 07:23
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vfdev-5 vfdev-5 merged commit 98b9286 into pytorch:master Aug 19, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Aug 19, 2022

vfdev-5 pushed a commit to vfdev-5/ignite that referenced this pull request Aug 22, 2022
…pytorch#2642)

* Implement feature

* Fix bug in docstring

* Fix bugs and tests

* Handle pytorch<1.9.0

* Fix mypy error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: handlers Core Handlers module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consolidate ZeRO state before checkpoint saving
2 participants