-
Notifications
You must be signed in to change notification settings - Fork 80
feat: fast random access for streamingDataset without chunk downloading #631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: fast random access for streamingDataset without chunk downloading #631
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #631 +/- ##
===================================
- Coverage 83% 83% -0%
===================================
Files 43 43
Lines 6611 6696 +85
===================================
+ Hits 5507 5553 +46
- Misses 1104 1143 +39 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds fast random access for StreamingDataset by introducing a no_store flag and byte-range fetching to avoid downloading entire chunks.
- Introduces
no_storeinBinaryReader,Cache, andStreamingDatasetto conditionally fetch only requested item bytes. - Implements
download_bytessupport inS3DownloaderandGCPDownloader, and addsread_item_bytesinBinaryReader. - Adds tests for
reader.read_item_bytesandconfig.download_chunk_bytes_from_index.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/streaming/test_reader.py | Added test_reader_read_bytes to validate per-item byte reads. |
| tests/streaming/test_config.py | Added test_config_download_chunk_bytes for config byte reads. |
| src/litdata/streaming/reader.py | Added no_store, read_item_bytes, and refactored read. |
| src/litdata/streaming/item_loader.py | Added default and concrete load_item_from_bytes methods. |
| src/litdata/streaming/downloader.py | Added base download_bytes, S3 and GCP implementations. |
| src/litdata/streaming/dataset.py | Propagated no_store and added slice support in __getitem__. |
| src/litdata/streaming/config.py | Added download_chunk_bytes_from_index for local/remote files. |
| src/litdata/streaming/cache.py | Propagated no_store through Cache. |
Comments suppressed due to low confidence (2)
tests/streaming/test_reader.py:7
- The test uses
os.path.joinbutosis not imported. Please addimport osat the top of this file.
import pytest
src/litdata/streaming/item_loader.py:110
- There are two
load_item_from_bytesdefinitions in this class (one stub and one concrete). Consolidate into a single method to avoid unexpected overrides.
def load_item_from_bytes(
bhimrazy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
I’ve added a few comments and questions.
It would also be helpful to include benchmarks comparing a few individual item loads with and without on_demand_bytes in the description.
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really neat !
Before submitting
What does this PR do?
Fixes #14
PyTreeLoaderand encrypted/compressed datasets are not supported.Visualize
Benchmark
for the fast-random-access pr, chunk size: 64MB (Done on lightning studio):
mainbranch: 20-22 seconds for 10 random accessfast-random-access: 5-6 seconds for 10 random accessNote: I tried random access to have enough separation to make sure not all random accesses are from the same chunk.
0.83 seconds v/s 2 secondsPR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃