Skip to content

test allreduce failures for diloco #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 24, 2025

Summary:

  • test when allreduce fails but no new nodes join
  • added another event of type AllreduceFailure
  • This new event required modifying some manager code to inject the failure

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 24, 2025
Summary:
- test when allreduce fails but no new nodes join
- added another event of type `AllreduceFailure`
- This new event required modifying some manager code to inject the failure
# used to artificially fail the next allreduce by tests
self._TEST_should_fail_allreduce = False

def TEST_fail_allreduce(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a wrapped/mocked PG that injects the failure instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let me see if we can create a wrapper so we can keep using real pg. thinking of using mocked pg for deterministic simulation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that we should try to keep test related functionality contained in our test files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants