Skip to content

fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304

Draft
Luodian wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
Luodian:fix/abort-all-ranks-on-eval-exception
Draft

fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304
Luodian wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
Luodian:fix/abort-all-ranks-on-eval-exception

Conversation

@Luodian
Copy link
Copy Markdown
Contributor

@Luodian Luodian commented Apr 23, 2026

Summary

When torchrun / accelerate launches multiple ranks and one rank's evaluate() raises, today's handler in cli_evaluate logs the error, appends None, and lets the rank return normally. The other ranks keep going and block on the next collective (gather_object, barrier, etc.) on NCCL until the launcher's wall-clock timeout tears the job down — which on a SLURM cluster can waste hours of GPU time for a failure that is visible in rank 0's log at second zero.

Change

If torch.distributed is initialized when the outer exception handler fires, destroy the process group and sys.exit(1) so the launcher's elastic supervisor sees a non-zero rank and tears down the rest of the world immediately.

results_list.append(None)
+ if torch.distributed.is_available() and torch.distributed.is_initialized():
+     try:
+         torch.distributed.destroy_process_group()
+     except Exception:
+         pass
+     sys.exit(1)

Non-goals / scope

  • No behavior change for single-process runs — the is_initialized() check gates the new path entirely.
  • Does not try to distinguish retriable from fatal errors. The outer handler only fires after the inner eval loop already gave up.

Test plan

  • Multi-rank run where rank N raises manually — verify all ranks exit within seconds instead of hitting wall-clock.
  • Single-process run — verify no change in happy path.

When torchrun or accelerate launches multiple ranks and one rank's
evaluate() raises, the current handler logs the error, appends None,
and lets the rank return normally. The other ranks continue into the
next collective (gather_object, barrier, etc.) and block on NCCL until
the launcher's wall-clock timeout tears the job down. On a cluster
this can waste hours of GPU time for a failure that is visible in a
single rank's log at second zero.

Propagate the failure immediately: if torch.distributed is initialized
when the exception reaches the outer handler, destroy the process group
and sys.exit(1) so the launcher's elastic supervisor sees a non-zero
exit and tears down the rest of the world.

No behavior change for single-process runs — the is_initialized check
gates the new path entirely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant