Skip to content

Conversation

@bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Jul 6, 2025

What does this PR do?

Overview

This PR introduces a new StreamingRawDataset class designed for efficient streaming of raw files directly from cloud storage (e.g., S3, GCS).

Unlike optimized formats (e.g., litdata chunks), this class allows users to stream raw datasets (e.g., images, text files, etc) without requiring prior preprocessing or data conversion. It supports scalable, on-the-fly access to data, making it well-suited for training or inference workflows where immediate access to raw files is needed.

Note: This is a beta feature and may be subject to minor changes in future updates based on feedback and pending follow-ups.

Usage Example

from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader

# Initialize the streaming raw dataset from S3 path
dataset = StreamingRawDataset("s3://bucket/files/") # also accepts transform here

# Access a single file (raw bytes)
first_sample = dataset[0]
print(f"Type of first sample: {type(first_sample)}")  # return bytes

# Access multiple files by indices using __getitems__
indices = [0, 1, 2]
samples = dataset.__getitems__(indices) # returns a list of 'bytes'
print(f"Retrieved {len(samples)} samples. Type of first sample: {type(samples[0])}")

# Use PyTorch DataLoader to batch iterate over the dataset
dataloader = DataLoader(dataset, batch_size=32)

for batch_idx, batch in enumerate(dataloader):
    print(f"Batch {batch_idx} size: {len(batch)}")
    # Each item in batch is raw bytes of a file
    # You can process batch here...

    # For demo, only process first batch
    break
    
 
# output
# Type of first sample: <class 'bytes'>
# Retrieved 3 samples. Type of first sample: <class 'bytes'>
# Batch 0 size: 32

ImageNet Benchmarks

# without transform
python benchmarks/stream_raw_imagenet.py --bytes --epoch 1 

# with transform
python benchmarks/stream_raw_imagenet.py --epoch 1 
image

Follow-ups

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@bhimrazy bhimrazy self-assigned this Jul 6, 2025
@bhimrazy bhimrazy marked this pull request as draft July 6, 2025 18:44
@bhimrazy bhimrazy requested a review from Copilot July 6, 2025 18:45

This comment was marked as outdated.

@codecov
Copy link

codecov bot commented Jul 6, 2025

Codecov Report

❌ Patch coverage is 78.59649% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 83%. Comparing base (aa467f9) to head (9c34bac).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #652    +/-   ##
====================================
- Coverage    83%    83%    -0%     
====================================
  Files        49     50     +1     
  Lines      6812   7097   +285     
====================================
+ Hits       5686   5910   +224     
- Misses     1126   1187    +61     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Borda Borda requested review from Borda, Copilot and tchaton July 25, 2025 19:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new StreamingRawDataset class for efficient streaming of raw files directly from cloud storage (S3, GCS, Azure) without requiring preprocessing. The implementation supports scalable, on-the-fly access to raw datasets with features like fast indexing, local caching, and both synchronous and asynchronous file downloads.

Key changes include:

  • New StreamingRawDataset class with support for __getitem__ and __getitems__ methods
  • Enhanced downloader classes with download_fileobj and adownload_fileobj methods for direct file streaming
  • File indexing system with ZSTD compression for efficient metadata caching

Reviewed Changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/litdata/streaming/raw_dataset.py Core implementation of StreamingRawDataset, FileIndexer, and CacheManager classes
src/litdata/streaming/downloader.py Added download_fileobj and adownload_fileobj methods to all downloader classes
src/litdata/streaming/client.py Enhanced S3Client to store session reference for obstore integration
src/litdata/constants.py Added new requirement checks for obstore and asyncio
tests/streaming/test_raw_dataset.py Comprehensive test suite for the new streaming raw dataset functionality
tests/streaming/test_downloader.py Tests for new downloader methods
tests/conftest.py Added obstore mock fixture for testing
requirements/test.txt Added pytest-asyncio dependency for async test support
requirements/extras.txt Added asyncio dependency
benchmarks/stream_raw_imagenet.py Benchmark script for ImageNet streaming performance testing
Comments suppressed due to low confidence (1)

requirements/extras.txt:9

  • The 'asyncio' module is part of Python's standard library since Python 3.4 and does not need to be installed as an external dependency. Remove this line as it will cause installation errors.
asyncio

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really neat !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants