Resuming training from a mid-epoch checkpoint using a StatefulDataLoader #21276
Unanswered
SimonWagnerHD
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I try to resume training from a checkpoint which was saved during the middle of an epoch. To ensure that also the dataloader states are restored I use the StatefulDataLoader from https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader. For testing purposes I set the batch size to 1 and reduced the size of my training set to 10.
When resuming training from a checkpoint that was saved after 4 steps, the dataloader correctly loads the remaining 6 samples, however the following validation epoch is skipped. Investigating this further I found that when using the normal Dataloader, lightning performs a training epoch with 7 samples. So I suspect that there is some mismatch between the internal loop states of the Trainer and the dataloader state, which causes the validation epoch to be skipped.
Is this a known issue, and is there anything I can do to fix this?
As a sidenote, also the RichProgressBar does not properly reflect the advanced epoch progress when loading a mid-epoch checkpoint.
Beta Was this translation helpful? Give feedback.
All reactions