Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

bhimrazy · 2025-03-29T08:02:12Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #534.

This PR fixes the redundant chunk download issue by preventing individual chunk download requests when a download for multiple chunk indexes has already been queued while the dataset is in iterator mode.

Further details by copilot here

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Copilot

Pull Request Overview

This PR removes a redundant download request for a single chunk index in BinaryReader to streamline the download process.

Removed a separate download call for an individual chunk index when it differs from the last processed index
Ensured that the download request is solely handled by the download call for multiple chunk indexes

Comments suppressed due to low confidence (1)

src/litdata/streaming/reader.py:370

Please ensure that there are tests validating the behavior of the download call when only a single chunk index exists, to confirm that removing the individual download request does not introduce regressions.

self._prepare_thread.download(index.chunk_indexes)

Copilot

Pull Request Overview

This PR removes a redundant download request by distinguishing between bulk and individual chunk downloads, addressing issue #534.

Introduces a flag to track bulk download requests.
Adjusts conditional logic in the read method to prevent duplicate download requests.

src/litdata/streaming/reader.py

codecov · 2025-03-29T09:57:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (4ff18da) to head (fe45539).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #535   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        39     39           
  Lines      5892   5896    +4     
===================================
+ Hits       4631   4651   +20     
+ Misses     1261   1245   -16

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

Copilot

Pull Request Overview

This PR fixes issue #534 by preventing a redundant individual chunk download request when a bulk download for multiple chunk indexes has already been queued.

Introduces a new flag (_chunks_queued_for_download) in the reader to track when a bulk download is initiated.
Updates the download logic to conditionally request individual chunk downloads only when a bulk download isn’t queued.
Adds tests to verify that the internal flag is correctly set and reset for both iterator and index access modes.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
tests/streaming/test_dataset.py	Adds test cases to validate the _chunks_queued_for_download flag behavior.
src/litdata/streaming/reader.py	Introduces and uses a flag to prevent redundant chunk download requests.

Comments suppressed due to low confidence (1)

src/litdata/streaming/reader.py:372

[nitpick] Consider renaming '_chunks_queued_for_download' to '_bulk_download_queued' for clarity, as the flag specifically indicates that a bulk download request has been queued.

self._chunks_queued_for_download = True

tchaton

Really nice catch

deependujha · 2025-04-06T05:57:17Z

interesting, IMO, this whole thing is wrong.

Remove redundant chunk index download request in BinaryReader

3015686

bhimrazy requested a review from Copilot March 29, 2025 08:02

bhimrazy self-assigned this Mar 29, 2025

Copilot AI reviewed Mar 29, 2025

View reviewed changes

bhimrazy added the bugfix label Mar 29, 2025

update the condition

4655138

bhimrazy requested a review from Copilot March 29, 2025 09:40

Copilot AI reviewed Mar 29, 2025

View reviewed changes

src/litdata/streaming/reader.py Show resolved Hide resolved

Reset last chunk index and queued download state on close

f90e19a

add test case for dataset as iterator and non iterator

03745c7

bhimrazy marked this pull request as ready for review March 29, 2025 10:32

bhimrazy requested review from tchaton, lantiga and justusschock as code owners March 29, 2025 10:32

pre-commit-ci bot and others added 2 commits March 29, 2025 10:32

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c43479

for more information, see https://pre-commit.ci

fix typo

4b65a1b

bhimrazy requested a review from Copilot March 29, 2025 10:34

Copilot AI reviewed Mar 29, 2025

View reviewed changes

update comment for clarity on chunk download conditions

fe45539

bhimrazy changed the title ~~Remove redundant chunk index download request in BinaryReader~~ Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode Mar 29, 2025

tchaton approved these changes Mar 29, 2025

View reviewed changes

bhimrazy merged commit ee03383 into Lightning-AI:main Mar 29, 2025
29 checks passed

bhimrazy deleted the fix/534-repeated-chunk-indexes-added-to-download-queue branch March 29, 2025 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

Uh oh!

bhimrazy commented Mar 29, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 29, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

deependujha commented Apr 6, 2025

Uh oh!

Uh oh!

Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

Fix: redundant chunk index download request in BinaryReader , when dataset in iter mode #535

Uh oh!

Conversation

bhimrazy commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

codecov bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deependujha commented Apr 6, 2025

Uh oh!

Uh oh!

bhimrazy commented Mar 29, 2025 •

edited

Loading

codecov bot commented Mar 29, 2025 •

edited

Loading