fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304
Draft
Luodian wants to merge 1 commit into
Draft
fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304Luodian wants to merge 1 commit into
Luodian wants to merge 1 commit into
Conversation
When torchrun or accelerate launches multiple ranks and one rank's evaluate() raises, the current handler logs the error, appends None, and lets the rank return normally. The other ranks continue into the next collective (gather_object, barrier, etc.) and block on NCCL until the launcher's wall-clock timeout tears the job down. On a cluster this can waste hours of GPU time for a failure that is visible in a single rank's log at second zero. Propagate the failure immediately: if torch.distributed is initialized when the exception reaches the outer handler, destroy the process group and sys.exit(1) so the launcher's elastic supervisor sees a non-zero exit and tears down the rest of the world. No behavior change for single-process runs — the is_initialized check gates the new path entirely.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
torchrun/acceleratelaunches multiple ranks and one rank'sevaluate()raises, today's handler incli_evaluatelogs the error, appendsNone, and lets the rank return normally. The other ranks keep going and block on the next collective (gather_object,barrier, etc.) on NCCL until the launcher's wall-clock timeout tears the job down — which on a SLURM cluster can waste hours of GPU time for a failure that is visible in rank 0's log at second zero.Change
If
torch.distributedis initialized when the outer exception handler fires, destroy the process group andsys.exit(1)so the launcher's elastic supervisor sees a non-zero rank and tears down the rest of the world immediately.Non-goals / scope
is_initialized()check gates the new path entirely.Test plan