Failed to Resume Training w/ CombinedStreamingDataset #363
Labels
bug
Something isn't working
duplicate
This issue or pull request already exists
help wanted
Extra attention is needed
🐛 Bug
My training run crashed so I tried to resume it from the previous PyTorch Lightning checkpoint.
When I do so, I get the following error --
To Reproduce
Unsure, what a minimal example of this bug is.
Code sample
My dataset is initialized as --
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, iterate_over_all=True)
I do not weight the individual datasets.
Expected behavior
Able to resume training.
Environment
conda
,pip
, source): pipAdditional context
The bug I showed above emerges when I run on the latest version of LitData. Previously, we were training on an older version of LitData and the following error was cropping up instead --
Also note that this error is happening several epochs into training with data that is stored locally (not being streamed from an S3 blob store).
The text was updated successfully, but these errors were encountered: