-
Notifications
You must be signed in to change notification settings - Fork 65
Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes a redundant download request for a single chunk index in BinaryReader to streamline the download process.
- Removed a separate download call for an individual chunk index when it differs from the last processed index
- Ensured that the download request is solely handled by the download call for multiple chunk indexes
Comments suppressed due to low confidence (1)
src/litdata/streaming/reader.py:370
- Please ensure that there are tests validating the behavior of the download call when only a single chunk index exists, to confirm that removing the individual download request does not introduce regressions.
self._prepare_thread.download(index.chunk_indexes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes a redundant download request by distinguishing between bulk and individual chunk downloads, addressing issue #534.
- Introduces a flag to track bulk download requests.
- Adjusts conditional logic in the read method to prevent duplicate download requests.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #535 +/- ##
===================================
Coverage 79% 79%
===================================
Files 39 39
Lines 5892 5896 +4
===================================
+ Hits 4631 4651 +20
+ Misses 1261 1245 -16 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes issue #534 by preventing a redundant individual chunk download request when a bulk download for multiple chunk indexes has already been queued.
- Introduces a new flag (_chunks_queued_for_download) in the reader to track when a bulk download is initiated.
- Updates the download logic to conditionally request individual chunk downloads only when a bulk download isn’t queued.
- Adds tests to verify that the internal flag is correctly set and reset for both iterator and index access modes.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/streaming/test_dataset.py | Adds test cases to validate the _chunks_queued_for_download flag behavior. |
src/litdata/streaming/reader.py | Introduces and uses a flag to prevent redundant chunk download requests. |
Comments suppressed due to low confidence (1)
src/litdata/streaming/reader.py:372
- [nitpick] Consider renaming '_chunks_queued_for_download' to '_bulk_download_queued' for clarity, as the flag specifically indicates that a bulk download request has been queued.
self._chunks_queued_for_download = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice catch
interesting, IMO, this whole thing is wrong. |
Before submitting
What does this PR do?
Fixes #534.
This PR fixes the redundant chunk download issue by preventing individual chunk download requests when a download for multiple chunk indexes has already been queued while the dataset is in iterator mode.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃