-
Notifications
You must be signed in to change notification settings - Fork 80
feat(litdata): Add Support for StreamingRawDataset to Stream Raw Datasets from Cloud Storage
#652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(litdata): Add Support for StreamingRawDataset to Stream Raw Datasets from Cloud Storage
#652
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #652 +/- ##
====================================
- Coverage 83% 83% -0%
====================================
Files 49 50 +1
Lines 6812 7097 +285
====================================
+ Hits 5686 5910 +224
- Misses 1126 1187 +61 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
… remove from extras.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new StreamingRawDataset class for efficient streaming of raw files directly from cloud storage (S3, GCS, Azure) without requiring preprocessing. The implementation supports scalable, on-the-fly access to raw datasets with features like fast indexing, local caching, and both synchronous and asynchronous file downloads.
Key changes include:
- New
StreamingRawDatasetclass with support for__getitem__and__getitems__methods - Enhanced downloader classes with
download_fileobjandadownload_fileobjmethods for direct file streaming - File indexing system with ZSTD compression for efficient metadata caching
Reviewed Changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/litdata/streaming/raw_dataset.py |
Core implementation of StreamingRawDataset, FileIndexer, and CacheManager classes |
src/litdata/streaming/downloader.py |
Added download_fileobj and adownload_fileobj methods to all downloader classes |
src/litdata/streaming/client.py |
Enhanced S3Client to store session reference for obstore integration |
src/litdata/constants.py |
Added new requirement checks for obstore and asyncio |
tests/streaming/test_raw_dataset.py |
Comprehensive test suite for the new streaming raw dataset functionality |
tests/streaming/test_downloader.py |
Tests for new downloader methods |
tests/conftest.py |
Added obstore mock fixture for testing |
requirements/test.txt |
Added pytest-asyncio dependency for async test support |
requirements/extras.txt |
Added asyncio dependency |
benchmarks/stream_raw_imagenet.py |
Benchmark script for ImageNet streaming performance testing |
Comments suppressed due to low confidence (1)
requirements/extras.txt:9
- The 'asyncio' module is part of Python's standard library since Python 3.4 and does not need to be installed as an external dependency. Remove this line as it will cause installation errors.
asyncio
…ingRawDataset and StreamingDataset
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really neat !
What does this PR do?
Overview
This PR introduces a new
StreamingRawDatasetclass designed for efficient streaming of raw files directly from cloud storage (e.g., S3, GCS).Unlike optimized formats (e.g., litdata chunks), this class allows users to stream raw datasets (e.g., images, text files, etc) without requiring prior preprocessing or data conversion. It supports scalable, on-the-fly access to data, making it well-suited for training or inference workflows where immediate access to raw files is needed.
Usage Example
ImageNet Benchmarks
Follow-ups
StreamingRawDataset#666fsspectoobstorelitdata/raw, exposeStreamingRawDatasetat top-level #671StreamingRawDatasetAsync Handling #661gcs_foldersinresolve_dir: resolved by Add GCP support for directory resolution inresolve_dir#659PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃