Skip to content

lighthouse: add commit_failures support #183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2025
Merged

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented May 1, 2025

This adds a new field to Lighthouse quorum for commit_failures. When a commit failure occurs, a network error has occurred and we thus force a new quorum to ensure that all workers reconfigure.

This is required so #182 will gracefully recover after failure.

Test plan:

cargo test
pytest torchft/manager_integ_test.py

@d4l3k d4l3k requested a review from H-Huang May 1, 2025 17:13
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2025
@d4l3k d4l3k requested review from fduwjj and tushar00jain May 1, 2025 17:13
@d4l3k d4l3k force-pushed the d4l3k/recover_errors branch from 161af71 to cd3a73d Compare May 1, 2025 17:26
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

} else if participants.iter().map(|p| p.commit_failures).sum::<i64>() > 0 {
state.quorum_id += 1;
info!(
"Detected commit failures, bumping quorum_id to {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe not necessary since the manager already logs it, but for lighthouse we could also log the participants with commit failures here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@d4l3k d4l3k force-pushed the d4l3k/recover_errors branch from cd3a73d to e41fd5c Compare May 1, 2025 19:38
@d4l3k d4l3k force-pushed the d4l3k/recover_errors branch from e41fd5c to 7ffa3e8 Compare May 1, 2025 19:49
@d4l3k d4l3k merged commit 08ccd96 into main May 1, 2025
8 checks passed
@d4l3k d4l3k deleted the d4l3k/recover_errors branch May 1, 2025 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants