Open
Description
Bug description
This is an issue created from the discussion in this thread: #18110 (comment) which seems to affect a few people.
When resuming from a checkpoint using mid-epoch checkpointing (in my case I use the ModelCheckpoint callback with train_time_interval
every 1 hour), two primary cases arise:
- The checkpoint was in mid-epoch and the training resumes without issues.
- The checkpoint was at end of epoch, just before validation, in which case the model loads and performs a single validation step which causes an evaluation with a single batch, and alters metrics at this point.
Certain relevant elements:
- Validation is only performed once every epoch (at the end)
- I use two checkpoint callbacks: one with time interval and one at the end of epoch during training.
- This seems to happen when an error occurs during validation.
Do you know if this issue can be solved or bypassed ? What are the reasons this happens ?
What version are you seeing the problem on?
v2.2
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
- PyTorch Lightning Version: 2.2.0
- PyTorch Version: 2.0.0
- Python version: 3.11.8
More info
Don't have a minimal working example, but if need be I can try to make one.
cc @awaelchli