Skip to content

Fix: resume issues with resuming in combined streaming dataset in dataloader #507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bhimrazy
Copy link
Collaborator

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

How does this PR impact the user?

Currently, users experience issues when attempting to resume a combined streaming dataset with the streaming dataloader, as saving and restoring checkpoints doesn’t work as expected. This PR addresses the root cause of the error, enabling successful checkpoint resuming of the dataloader, ensuring smoother and more reliable training workflows.

What does this PR do?

Fixes #331.

  • Fixes IndexError when loading dataloader state before any iteration.
  • Enables resuming dataloader states for combined datasets (non-weighted).

This pr is the extension of #362

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

A lot , actually! 🙃
The PR (#362) had been pending since last September, but now, the underlying issue has finally been resolved with #449.

@bhimrazy bhimrazy self-assigned this Mar 11, 2025
@bhimrazy
Copy link
Collaborator Author

bhimrazy commented Mar 11, 2025

🤞🫣

@bhimrazy bhimrazy added bugfix enhancement New feature or request labels Mar 11, 2025
Copy link

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (a8fc6a8) to head (bb16117).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #507   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        39     39           
  Lines      5844   5848    +4     
===================================
+ Hits       4591   4602   +11     
+ Misses     1253   1246    -7     
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy bhimrazy changed the title [wip] : Fix resume issues with resuming in combined streaming dataset in dataloader Fix: resume issues with resuming in combined streaming dataset in dataloader Mar 11, 2025
@bhimrazy bhimrazy marked this pull request as ready for review March 11, 2025 15:41
@tchaton tchaton merged commit 6848366 into Lightning-AI:main Mar 11, 2025
29 checks passed
@bhimrazy bhimrazy deleted the fix/combined-dataset-loading-states branch March 12, 2025 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Inconsistent Behavior with StreamingDataloader loading states (specific to CombinedStreamingDataset)
2 participants