Skip to content

OnExceptionCheckpoint callback suppresses exceptions and results in NCCL timeout #20187

Open
@jackdent

Description

@jackdent

Bug description

When running Lightning with a multi-device training strategy (e.g. with DDP), using the OnExceptionCheckpoint callback:

  • silently swallows exceptions, which makes it challenging to identify the cause of errors
  • results in a NCCL timeout

This is due to the following:

As described in the docstring for Trainer.save_checkpoint:

This method needs to be called on all processes in case the selected strategy is handling distributed checkpointing.

In practice, this means that our jobs eventually time out with a NCCL error, and don't print the original exception.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.4.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions