Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Jan 28, 2026

When resuming from a checkpoint with reload_dataloaders_every_n_epochs, the dataloader was not being reloaded at the correct epoch. This was because setup_data() was overwriting _last_train_dl_reload_epoch with the current epoch during checkpoint restoration, losing the information about when the dataloader was actually last reloaded.

The fix:

  1. Save _last_train_dl_reload_epoch in checkpoint state
  2. Restore _last_train_dl_reload_epoch from checkpoint on load
  3. Only update _last_train_dl_reload_epoch when actually reloading the dataloader or during initial setup (not when resuming)

This ensures _should_reload_train_dl returns the correct value after resuming from a checkpoint.

Backward compatible: old checkpoints without this key will default to float('-inf'), which triggers a reload (the safest behavior).

Fixes #21492


📚 Documentation preview 📚: https://pytorch-lightning--21514.org.readthedocs.build/en/21514/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 28, 2026
When resuming from a checkpoint with reload_dataloaders_every_n_epochs,
the dataloader was not being reloaded at the correct epoch. This was
because setup_data() was overwriting _last_train_dl_reload_epoch with
the current epoch during checkpoint restoration, losing the information
about when the dataloader was actually last reloaded.

The fix:
1. Save _last_train_dl_reload_epoch in checkpoint state
2. Restore _last_train_dl_reload_epoch from checkpoint on load
3. Only update _last_train_dl_reload_epoch when actually reloading
   the dataloader or during initial setup (not when resuming)

This ensures _should_reload_train_dl returns the correct value after
resuming from a checkpoint.

Backward compatible: old checkpoints without this key will default to
float('-inf'), which triggers a reload (the safest behavior).

Fixes Lightning-AI#21492
@littlebullGit littlebullGit force-pushed the fix/21492-dataloader-reload-checkpoint branch from 5c24d70 to 6afeb53 Compare January 28, 2026 02:04
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (0c2025d) to head (267371d).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (0c2025d) and HEAD (267371d). Click for more details.

HEAD has 726 uploads less than BASE
Flag BASE (0c2025d) HEAD (267371d)
cpu 198 33
python 18 3
lightning_fabric 54 0
pytest 99 0
python3.12 54 9
python3.13 18 3
lightning 90 15
python3.11 36 6
python3.12.7 54 9
python3.10 18 3
pytorch2.8 18 6
pytorch_lightning 54 18
pytest-full 99 33
pytorch2.5.1 9 3
pytorch2.7 9 3
pytorch2.1 18 6
pytorch2.3 9 3
pytorch2.2.2 9 3
pytorch2.6 9 3
pytorch2.4.1 9 3
pytorch2.9 9 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21514     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       24071    24021     -50     
=========================================
- Hits        20867    18965   -1902     
- Misses       3204     5056   +1852     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataloader reload bug when loading from checkpoint

1 participant