Resuming training from a mid-epoch checkpoint using a StatefulDataLoader #21276

SimonWagnerHD · 2025-10-08T12:45:39Z

SimonWagnerHD
Oct 8, 2025

I try to resume training from a checkpoint which was saved during the middle of an epoch. To ensure that also the dataloader states are restored I use the StatefulDataLoader from https://github.com/pytorch/data/tree/main/torchdata/stateful_dataloader. For testing purposes I set the batch size to 1 and reduced the size of my training set to 10.

When resuming training from a checkpoint that was saved after 4 steps, the dataloader correctly loads the remaining 6 samples, however the following validation epoch is skipped. Investigating this further I found that when using the normal Dataloader, lightning performs a training epoch with 7 samples. So I suspect that there is some mismatch between the internal loop states of the Trainer and the dataloader state, which causes the validation epoch to be skipped.

Is this a known issue, and is there anything I can do to fix this?

As a sidenote, also the RichProgressBar does not properly reflect the advanced epoch progress when loading a mid-epoch checkpoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resuming training from a mid-epoch checkpoint using a StatefulDataLoader #21276

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Resuming training from a mid-epoch checkpoint using a StatefulDataLoader #21276

Uh oh!

SimonWagnerHD Oct 8, 2025

Replies: 0 comments

SimonWagnerHD
Oct 8, 2025