feat(litdata): Add Support for `StreamingRawDataset` to Stream Raw Datasets from Cloud Storage #652

bhimrazy · 2025-07-06T18:44:28Z

What does this PR do?

Overview

This PR introduces a new StreamingRawDataset class designed for efficient streaming of raw files directly from cloud storage (e.g., S3, GCS).

Unlike optimized formats (e.g., litdata chunks), this class allows users to stream raw datasets (e.g., images, text files, etc) without requiring prior preprocessing or data conversion. It supports scalable, on-the-fly access to data, making it well-suited for training or inference workflows where immediate access to raw files is needed.

Note: This is a beta feature and may be subject to minor changes in future updates based on feedback and pending follow-ups.

Usage Example

from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader

# Initialize the streaming raw dataset from S3 path
dataset = StreamingRawDataset("s3://bucket/files/") # also accepts transform here

# Access a single file (raw bytes)
first_sample = dataset[0]
print(f"Type of first sample: {type(first_sample)}")  # return bytes

# Access multiple files by indices using __getitems__
indices = [0, 1, 2]
samples = dataset.__getitems__(indices) # returns a list of 'bytes'
print(f"Retrieved {len(samples)} samples. Type of first sample: {type(samples[0])}")

# Use PyTorch DataLoader to batch iterate over the dataset
dataloader = DataLoader(dataset, batch_size=32)

for batch_idx, batch in enumerate(dataloader):
    print(f"Batch {batch_idx} size: {len(batch)}")
    # Each item in batch is raw bytes of a file
    # You can process batch here...

    # For demo, only process first batch
    break
    
 
# output
# Type of first sample: <class 'bytes'>
# Retrieved 3 samples. Type of first sample: <class 'bytes'>
# Batch 0 size: 32

ImageNet Benchmarks

# without transform
python benchmarks/stream_raw_imagenet.py --bytes --epoch 1 

# with transform
python benchmarks/stream_raw_imagenet.py --epoch 1

Follow-ups

Add support for grouping files, resolved by feat(litdata): Add grouping functionality in StreamingRawDataset #665
Add support for storing index file to the source cloud storage, resolved by feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset #666

Note from thomas: Can we add an argument to the dataset to re-compute the index. Once the index is computed, can we push it the index to the folder, so next read is fast unless that argument is set to True.
Switch indexing from fsspec to obstore
Move the file indexer to its own folder & reorganize, resolved by ref: Move raw dataset code to litdata/raw, expose StreamingRawDataset at top-level #671
Check for caching files
Test for azure and other scenarios
Possible runtime error from eventloop , reported by copilot, resolved by fix: StreamingRawDataset Async Handling #661
Add support for gcs_folders in resolve_dir : resolved by Add GCP support for directory resolution inresolve_dir #659

Note from thomas: it would be good to support internal studio path such as /teamspace/gcs_folders/... like we do for StreamingDataset

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

…ities

…aset

codecov · 2025-07-06T19:10:20Z

Codecov Report

❌ Patch coverage is 78.59649% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 83%. Comparing base (aa467f9) to head (9c34bac).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff          @@
##           main   #652    +/-   ##
====================================
- Coverage    83%    83%    -0%     
====================================
  Files        49     50     +1     
  Lines      6812   7097   +285     
====================================
+ Hits       5686   5910   +224     
- Misses     1126   1187    +61

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/litdata/streaming/raw_dataset.py

…ovements

…ing and indexing

for more information, see https://pre-commit.ci

… remove from extras.txt

Copilot

Pull Request Overview

This PR introduces a new StreamingRawDataset class for efficient streaming of raw files directly from cloud storage (S3, GCS, Azure) without requiring preprocessing. The implementation supports scalable, on-the-fly access to raw datasets with features like fast indexing, local caching, and both synchronous and asynchronous file downloads.

Key changes include:

New StreamingRawDataset class with support for __getitem__ and __getitems__ methods
Enhanced downloader classes with download_fileobj and adownload_fileobj methods for direct file streaming
File indexing system with ZSTD compression for efficient metadata caching

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/litdata/streaming/raw_dataset.py`	Core implementation of `StreamingRawDataset`, `FileIndexer`, and `CacheManager` classes
`src/litdata/streaming/downloader.py`	Added `download_fileobj` and `adownload_fileobj` methods to all downloader classes
`src/litdata/streaming/client.py`	Enhanced S3Client to store session reference for obstore integration
`src/litdata/constants.py`	Added new requirement checks for `obstore` and `asyncio`
`tests/streaming/test_raw_dataset.py`	Comprehensive test suite for the new streaming raw dataset functionality
`tests/streaming/test_downloader.py`	Tests for new downloader methods
`tests/conftest.py`	Added obstore mock fixture for testing
`requirements/test.txt`	Added pytest-asyncio dependency for async test support
`requirements/extras.txt`	Added asyncio dependency
`benchmarks/stream_raw_imagenet.py`	Benchmark script for ImageNet streaming performance testing

Comments suppressed due to low confidence (1)

requirements/extras.txt:9

The 'asyncio' module is part of Python's standard library since Python 3.4 and does not need to be installed as an external dependency. Remove this line as it will cause installation errors.

asyncio

src/litdata/streaming/downloader.py

src/litdata/streaming/raw_dataset.py

…ingRawDataset and StreamingDataset

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… storage

… in README

tchaton

Looks really neat !

bhimrazy added 7 commits July 6, 2025 16:48

feat add support for initial raw dataset

c27fa65

update

2f1836e

refactor: remove unused get_classes method from StreamingRawDataset

37937db

update docs

741bbbb

feat: enhance StreamingRawDataset with caching and preloading capabil…

f9767d9

…ities

feat: add adaptive preloading and cache statistics to StreamingRawDat…

229385f

…aset

update

8132010

bhimrazy self-assigned this Jul 6, 2025

bhimrazy requested review from justusschock, lantiga and tchaton as code owners July 6, 2025 18:44

bhimrazy marked this pull request as draft July 6, 2025 18:44

bhimrazy requested a review from Copilot July 6, 2025 18:45

This comment was marked as outdated.

Sign in to view

tchaton reviewed Jul 8, 2025

View reviewed changes

src/litdata/streaming/raw_dataset.py Outdated Show resolved Hide resolved

bhimrazy and others added 14 commits July 13, 2025 14:24

Merge branch 'main' into feat/add-support-for-raw-dataset

f803019

feat: enhance StreamingRawDataset with file indexing and caching impr…

6237e0e

…ovements

now the indexing part is abit better and modular

d32ad5e

update

32432ad

feat: refactor StreamingRawDataset and CacheManager for improved cach…

5368559

…ing and indexing

implement ZSTD compression for file index caching in StreamingRawDataset

b8a2530

fix: update index file path in BaseIndexer to use index_path for caching

f4c40c0

[pre-commit.ci] auto fixes from pre-commit.com hooks

4468e2c

for more information, see https://pre-commit.ci

fix types

7b733d9

fix types

a667bfd

Merge branch 'main' into feat/add-support-for-raw-dataset

09e76de

add cache for len

e4355ee

finally added the sync way of downloading the files

8b84764

[pre-commit.ci] auto fixes from pre-commit.com hooks

d2fcf50

for more information, see https://pre-commit.ci

bhimrazy added 2 commits July 25, 2025 18:55

update deserialize_jpeg

c00a01a

chore: update requirements to include obstore in requirements.txt and…

007aa96

… remove from extras.txt

Borda requested review from Borda, Copilot and tchaton July 25, 2025 19:12

Copilot AI reviewed Jul 25, 2025

View reviewed changes

src/litdata/streaming/downloader.py Outdated Show resolved Hide resolved

src/litdata/streaming/downloader.py Show resolved Hide resolved

src/litdata/streaming/raw_dataset.py Outdated Show resolved Hide resolved

src/litdata/streaming/raw_dataset.py Show resolved Hide resolved

bhimrazy and others added 12 commits July 25, 2025 19:20

revert changes of client

52ece4b

docs: add streaming speed benchmarks for raw dataset performance

bb1a077

docs: update streaming speed benchmarks and add usage note for Stream…

b3a4723

…ingRawDataset and StreamingDataset

docs: add usage example for streaming raw datasets from cloud storage

43b1bc1

Update src/litdata/streaming/raw_dataset.py

1853428

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

revert the not needed change on s3 downloader

1ab0d35

fix indent

1a90881

fix: reintroduce boto3 import in S3Downloader class

830dfd1

docs: enhance clarity in README for streaming raw datasets from cloud…

2e3892d

… storage

docs: clarify description of streaming raw datasets in README

dec88e2

docs: add prerequisites for streaming raw datasets from cloud storage…

91f0d33

… in README

fix comment

9c34bac

tchaton approved these changes Jul 27, 2025

View reviewed changes

tchaton merged commit 268f3c7 into Lightning-AI:main Jul 27, 2025
35 checks passed

bhimrazy mentioned this pull request Aug 4, 2025

ref: Move raw dataset code to litdata/raw, expose StreamingRawDataset at top-level #671

Merged

bhimrazy deleted the feat/add-support-for-raw-dataset branch August 4, 2025 15:40

This was referenced Aug 10, 2025

feat(litdata/raw): Implement remote and local index caching for StreamingRawDataset #666

Merged

Switch raw dataset indexing from fsspec to obstore #686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(litdata): Add Support for `StreamingRawDataset` to Stream Raw Datasets from Cloud Storage #652

feat(litdata): Add Support for `StreamingRawDataset` to Stream Raw Datasets from Cloud Storage #652

Uh oh!

bhimrazy commented Jul 6, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov bot commented Jul 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(litdata): Add Support for StreamingRawDataset to Stream Raw Datasets from Cloud Storage #652

feat(litdata): Add Support for StreamingRawDataset to Stream Raw Datasets from Cloud Storage #652

Uh oh!

Conversation

bhimrazy commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Overview

Usage Example

ImageNet Benchmarks

PR review

Did you have fun?

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov bot commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(litdata): Add Support for `StreamingRawDataset` to Stream Raw Datasets from Cloud Storage #652

feat(litdata): Add Support for `StreamingRawDataset` to Stream Raw Datasets from Cloud Storage #652

bhimrazy commented Jul 6, 2025 •

edited

Loading

codecov bot commented Jul 6, 2025 •

edited

Loading