Release litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️ · Lightning-AI/litData

Lightning AI is excited to announce the release of litData v0.2.50, a lightweight and powerful streaming data library designed for fast AI model training.

This release focuses on improving the developer experience and performance for streamed datasets, with a particular focus on:

Faster random access support
Transform hooks for datasets
Better S3 interoperability
CI stability and performance improvements

👉 Check out the full changelog here: Compare v0.2.49...v0.2.50

🚀 Highlights

🔄 Fast Random Access (No Chunk Download Needed)

You can now access samples randomly from remote datasets without downloading entire chunks, dramatically reducing IO overhead during sparse reads.
This is especially useful for visualization tools or quickly inspecting your dataset without requiring full downloads.

🚀 Benchmark (on Lightning Studio, chunk size: 64MB)

10 random accesses:

🔹 v0.2.49: 20–22 seconds
🔹 v0.2.50: 5–6 seconds

The benchmark was designed to ensure enough separation between accesses, avoiding repeated reads from the same chunk.

Single item access:

🔹 v0.2.49: ~2 seconds
🔹 v0.2.50: ~0.83 seconds

Sample code

import litdata as ld

uri = "gs://litdata-gcp-bucket/optimized_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")

# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
    print(i, ds[i])

# it should download chunks now
for data in ds:
    print(data)

#631

🧩 Transform Support in StreamingDataset

You can now apply transforms to samples in StreamingDataset and CombinedStreamingDataset.

There are two supported ways to use it:

Pass a transform function when initializing the dataset:

# Define a simple transform function
torch_transform = transforms.Compose([
  transforms.Resize((256, 256)),       # Resize to 256x256
  transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
  transforms.Normalize(                # Normalize using ImageNet stats
      mean=[0.485, 0.456, 0.406], 
      std=[0.229, 0.224, 0.225]
  )
])

def transform_fn(x, *args, **kwargs):
    """Define your transform function."""
    return torch_transform(x)  # Apply the transform to the input image

# Create dataset with appropriate configuration
dataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=transform_fn)

Subclass and override the transform method:

class StreamingDatasetWithTransform(StreamingDataset):
        """A custom dataset class that inherits from StreamingDataset and applies a transform."""

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.torch_transform = transforms.Compose([
                transforms.Resize((256, 256)),       # Resize to 256x256
                transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
                transforms.Normalize(                # Normalize using ImageNet stats
                    mean=[0.485, 0.456, 0.406], 
                    std=[0.229, 0.224, 0.225]
                )
            ])

        # Define your transform method
        def transform(self, x, *args, **kwargs):
            """A simple transform function."""
            return self.torch_transform(x)


dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)

This makes it easier to insert preprocessing logic directly into the streaming pipeline.

#618

📖 AWS S3 Streaming Docs (with `boto3` & `unsigned requests` Example)

The documentation now includes a clear example of how to stream datasets from AWS S3 using boto3, including support for unsigned requests. It also prioritizes boto3 in the list of options for better clarity.

import botocore
from litdata import StreamingDataset

storage_options = {
    "config": botocore.config.Config(
        retries={"max_attempts": 1000, "mode": "adaptive"},
        signature_version=botocore.UNSIGNED,
    )
}

dataset = StreamingDataset(
    input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
    storage_options=storage_options,
)

#628

📖 Batching Methods in `CombinedStreamingDataset`

The CombinedStreamingDataset supports two different batching methods through the batching_method parameter:

Stratified Batching (Default):
With batching_method="stratified" (the default), each batch contains samples from multiple datasets according to the specified weights:

# Default stratified batching - batches mix samples from all datasets
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="stratified"  # This is the default
)

Per-Stream Batching:
With batching_method="per_stream", each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:

# Per-stream batching - each batch contains samples from only one dataset
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="per_stream"
)

#609

🐛 Bug Fixes

Fixed breaking tqdm progress bar in optimizing dataset
See before v/s after
- v0.2.49
- v0.2.50
#619

Suppressed multiple lightning-sdk warnings.
See before v/s after
- v0.2.49:
- v0.2.50:
#633

Fixed FileNotFoundError in file locking for downloader and cache systems.
#615, #617

🧪 Testing & CI

Python 3.12 and 3.13 now supported in CI matrix
#589
Test durations now logged for debugging
#614
Added missing CI dependencies.
#634
Refactored large, slow tests to reduce CI runtime
#629, #632

📎 Minor Improvements

Updated bug report template for easier Lightning Studio reproduction
#611

📦 Dependency Updates

mosaicml-streaming: 0.8.1 → 0.11.0
#624
transformers: <4.50.0 → <4.53.0
#623
pytest: 8.3.* → 8.4.*
#625

🧑‍💻 Contributors

Thanks to everyone who contributed to this release!
Special thanks to @bhimrazy, @deependujha, @Borda, and @dependabot.

What's Changed

🕒 Add Test Duration Reporting to Pytest in CI by @bhimrazy in #614
Update bug report template with Lightning Studio sharing instructions by @bhimrazy in #611
docs: Add documentation for batching methods in CombinedStreamingDataset by @bhimrazy in #609
fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #615
chore: suppress FileNotFoundError for locks in downloader classes by @bhimrazy in #617
Add Dependabot for Pip & GitHub Actions by @Borda in #621
chore(deps): update pytest requirement from ==8.3.* to ==8.4.* by @dependabot in #625
chore(deps): bump mosaicml-streaming from 0.8.1 to 0.11.0 by @dependabot in #624
chore(deps): update transformers requirement from <4.50.0 to <4.53.0 by @dependabot in #623
chore(deps): bump the gha-updates group with 2 updates by @dependabot in #622
Feat: add transform support for StreamingDataset by @deependujha in #618
fix: breaking tqdm progress bar in optimizing dataset by @deependujha in #619
upd: Optimize test (test_dataset_for_text_tokens_with_large_num_chunks) to reduce time consumption by @bhimrazy in #629
docs: Update documentation for AWS S3 dataset streaming with boto3 (including unsigned requests) by @bhimrazy in #628
CI: update testing matrix for Python versions (3.12 & 3.13) by @bhimrazy in #589
fix: specify test dependencies required for CI by @deependujha in #634
fix: multiplelightning-sdk update warning by @deependujha in #633
Refactor dataset preparation fixture to avoid redundancy and limit test parametrization to reduce time by @bhimrazy in #632
feat: fast random access for streamingDataset without chunk downloading by @deependujha in #631
bump version 0.2.50 by @deependujha in #637

Full Changelog: v0.2.49...v0.2.50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 Highlights

🔄 Fast Random Access (No Chunk Download Needed)

🚀 Benchmark (on Lightning Studio, chunk size: 64MB)

Sample code

🧩 Transform Support in StreamingDataset

📖 AWS S3 Streaming Docs (with `boto3` & `unsigned requests` Example)

📖 Batching Methods in `CombinedStreamingDataset`

🐛 Bug Fixes

🧪 Testing & CI

📎 Minor Improvements

📦 Dependency Updates

🧑‍💻 Contributors

What's Changed

Contributors

Uh oh!

litData v0.2.50: Fast Random Access & S3 Improvements 🧪⚡️

🚀 Highlights

🔄 Fast Random Access (No Chunk Download Needed)

🚀 Benchmark (on Lightning Studio, chunk size: 64MB)

Sample code

🧩 Transform Support in StreamingDataset

📖 AWS S3 Streaming Docs (with boto3 & unsigned requests Example)

📖 Batching Methods in CombinedStreamingDataset

🐛 Bug Fixes

🧪 Testing & CI

📎 Minor Improvements

📦 Dependency Updates

🧑‍💻 Contributors

What's Changed

Contributors

Uh oh!

📖 AWS S3 Streaming Docs (with `boto3` & `unsigned requests` Example)

📖 Batching Methods in `CombinedStreamingDataset`