fix(eval): abort all ranks on per-rank exception instead of deadlocking by Luodian · Pull Request #1304 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-04-23T07:09:21Z

Summary

When torchrun / accelerate launches multiple ranks and one rank's evaluate() raises, today's handler in cli_evaluate logs the error, appends None, and lets the rank return normally. The other ranks keep going and block on the next collective (gather_object, barrier, etc.) on NCCL until the launcher's wall-clock timeout tears the job down — which on a SLURM cluster can waste hours of GPU time for a failure that is visible in rank 0's log at second zero.

Change

If torch.distributed is initialized when the outer exception handler fires, destroy the process group and sys.exit(1) so the launcher's elastic supervisor sees a non-zero rank and tears down the rest of the world immediately.

results_list.append(None)
+ if torch.distributed.is_available() and torch.distributed.is_initialized():
+     try:
+         torch.distributed.destroy_process_group()
+     except Exception:
+         pass
+     sys.exit(1)

Non-goals / scope

No behavior change for single-process runs — the is_initialized() check gates the new path entirely.
Does not try to distinguish retriable from fatal errors. The outer handler only fires after the inner eval loop already gave up.

Test plan

Multi-rank run where rank N raises manually — verify all ranks exit within seconds instead of hitting wall-clock.
Single-process run — verify no change in happy path.

When torchrun or accelerate launches multiple ranks and one rank's evaluate() raises, the current handler logs the error, appends None, and lets the rank return normally. The other ranks continue into the next collective (gather_object, barrier, etc.) and block on NCCL until the launcher's wall-clock timeout tears the job down. On a cluster this can waste hours of GPU time for a failure that is visible in a single rank's log at second zero. Propagate the failure immediately: if torch.distributed is initialized when the exception reaches the outer handler, destroy the process group and sys.exit(1) so the launcher's elastic supervisor sees a non-zero exit and tears down the rest of the world. No behavior change for single-process runs — the is_initialized check gates the new path entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304

fix(eval): abort all ranks on per-rank exception instead of deadlocking#1304
Luodian wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
Luodian:fix/abort-all-ranks-on-eval-exception

Luodian commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Luodian commented Apr 23, 2026

Summary

Change

Non-goals / scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant