Skip to content

Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Mar 29, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #534.

This PR fixes the redundant chunk download issue by preventing individual chunk download requests when a download for multiple chunk indexes has already been queued while the dataset is in iterator mode.

Further details by copilot here

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@bhimrazy bhimrazy requested a review from Copilot March 29, 2025 08:02
@bhimrazy bhimrazy self-assigned this Mar 29, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes a redundant download request for a single chunk index in BinaryReader to streamline the download process.

  • Removed a separate download call for an individual chunk index when it differs from the last processed index
  • Ensured that the download request is solely handled by the download call for multiple chunk indexes
Comments suppressed due to low confidence (1)

src/litdata/streaming/reader.py:370

  • Please ensure that there are tests validating the behavior of the download call when only a single chunk index exists, to confirm that removing the individual download request does not introduce regressions.
self._prepare_thread.download(index.chunk_indexes)

@bhimrazy bhimrazy requested a review from Copilot March 29, 2025 09:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes a redundant download request by distinguishing between bulk and individual chunk downloads, addressing issue #534.

  • Introduces a flag to track bulk download requests.
  • Adjusts conditional logic in the read method to prevent duplicate download requests.

Copy link

codecov bot commented Mar 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (4ff18da) to head (fe45539).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #535   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        39     39           
  Lines      5892   5896    +4     
===================================
+ Hits       4631   4651   +20     
+ Misses     1261   1245   -16     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy bhimrazy marked this pull request as ready for review March 29, 2025 10:32
@bhimrazy bhimrazy requested a review from Copilot March 29, 2025 10:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issue #534 by preventing a redundant individual chunk download request when a bulk download for multiple chunk indexes has already been queued.

  • Introduces a new flag (_chunks_queued_for_download) in the reader to track when a bulk download is initiated.
  • Updates the download logic to conditionally request individual chunk downloads only when a bulk download isn’t queued.
  • Adds tests to verify that the internal flag is correctly set and reset for both iterator and index access modes.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/streaming/test_dataset.py Adds test cases to validate the _chunks_queued_for_download flag behavior.
src/litdata/streaming/reader.py Introduces and uses a flag to prevent redundant chunk download requests.
Comments suppressed due to low confidence (1)

src/litdata/streaming/reader.py:372

  • [nitpick] Consider renaming '_chunks_queued_for_download' to '_bulk_download_queued' for clarity, as the flag specifically indicates that a bulk download request has been queued.
self._chunks_queued_for_download = True

@bhimrazy bhimrazy changed the title Remove redundant chunk index download request in BinaryReader Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode Mar 29, 2025
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice catch

@bhimrazy bhimrazy merged commit ee03383 into Lightning-AI:main Mar 29, 2025
29 checks passed
@bhimrazy bhimrazy deleted the fix/534-repeated-chunk-indexes-added-to-download-queue branch March 29, 2025 20:39
@deependujha
Copy link
Collaborator

interesting, IMO, this whole thing is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repeated Chunk Indexes Added to Download Queue, Causing Redundant Downloads
3 participants