Skip to content

Mid-epoch resume causes a single unwanted validation step (which is not a sanity check) #20288

Open
@Youyoun

Description

@Youyoun

Bug description

This is an issue created from the discussion in this thread: #18110 (comment) which seems to affect a few people.

When resuming from a checkpoint using mid-epoch checkpointing (in my case I use the ModelCheckpoint callback with train_time_interval every 1 hour), two primary cases arise:

  • The checkpoint was in mid-epoch and the training resumes without issues.
  • The checkpoint was at end of epoch, just before validation, in which case the model loads and performs a single validation step which causes an evaluation with a single batch, and alters metrics at this point.

Certain relevant elements:

  • Validation is only performed once every epoch (at the end)
  • I use two checkpoint callbacks: one with time interval and one at the end of epoch during training.
  • This seems to happen when an error occurs during validation.

Do you know if this issue can be solved or bypassed ? What are the reasons this happens ?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
- PyTorch Lightning Version: 2.2.0
- PyTorch Version: 2.0.0
- Python version: 3.11.8

More info

Don't have a minimal working example, but if need be I can try to make one.

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrepro neededThe issue is missing a reproducible examplereproducibility

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions