Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Jul 29, 2021

  • Correctness fix PP+ZeRO for gradient accumulation
  • Cherry picked round robin grad partitioning fixes from master
  • Cherry picked ignore overlap/contiguous grad settings for ZeRO-1 from master

jeffra and others added 4 commits July 28, 2021 18:37
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Make round robin gradient partitioning configurable (default False)

* Use the correct default

* Log config setting

def allreduce_gradients(self, bucket_size=MEMORY_OPT_ALLREDUCE_SIZE):
# Pass (PP) gas boundary flag to optimizer (required for zero)
self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumulation_boundary(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.optimizer guaranteed to have is_gradient_accumulation_boundary attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We traced the break to this line. Crash went away after commenting this line.

@jeffra jeffra merged commit f93e22b into big-science Jul 30, 2021
@jeffra jeffra deleted the jeffra/big-science-patches branch July 30, 2021 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants