Skip to content

litData v0.2.50: Fast Random Access & S3 Improvements πŸ§ͺ⚑️

Choose a tag to compare

@deependujha deependujha released this 27 Jun 15:04
· 87 commits to main since this release
af64e33

Lightning AI is excited to announce the release of litData v0.2.50, a lightweight and powerful streaming data library designed for fast AI model training.

This release focuses on improving the developer experience and performance for streamed datasets, with a particular focus on:

  • Faster random access support
  • Transform hooks for datasets
  • Better S3 interoperability
  • CI stability and performance improvements

πŸ‘‰ Check out the full changelog here: Compare v0.2.49...v0.2.50


πŸš€ Highlights

πŸ”„ Fast Random Access (No Chunk Download Needed)

You can now access samples randomly from remote datasets without downloading entire chunks, dramatically reducing IO overhead during sparse reads.
This is especially useful for visualization tools or quickly inspecting your dataset without requiring full downloads.

πŸš€ Benchmark (on Lightning Studio, chunk size: 64MB)

10 random accesses:

  • πŸ”Ή v0.2.49: 20–22 seconds
  • πŸ”Ή v0.2.50: 5–6 seconds

The benchmark was designed to ensure enough separation between accesses, avoiding repeated reads from the same chunk.

Single item access:

  • πŸ”Ή v0.2.49: ~2 seconds
  • πŸ”Ή v0.2.50: ~0.83 seconds

Sample code

import litdata as ld

uri = "gs://litdata-gcp-bucket/optimized_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")

# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
    print(i, ds[i])

# it should download chunks now
for data in ds:
    print(data)

#631


🧩 Transform Support in StreamingDataset

You can now apply transforms to samples in StreamingDataset and CombinedStreamingDataset.

There are two supported ways to use it:

  1. Pass a transform function when initializing the dataset:
# Define a simple transform function
torch_transform = transforms.Compose([
  transforms.Resize((256, 256)),       # Resize to 256x256
  transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
  transforms.Normalize(                # Normalize using ImageNet stats
      mean=[0.485, 0.456, 0.406], 
      std=[0.229, 0.224, 0.225]
  )
])

def transform_fn(x, *args, **kwargs):
    """Define your transform function."""
    return torch_transform(x)  # Apply the transform to the input image

# Create dataset with appropriate configuration
dataset = StreamingDataset(data_dir, cache_dir=str(cache_dir), shuffle=shuffle, transform=transform_fn)
  1. Subclass and override the transform method:
class StreamingDatasetWithTransform(StreamingDataset):
        """A custom dataset class that inherits from StreamingDataset and applies a transform."""

        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.torch_transform = transforms.Compose([
                transforms.Resize((256, 256)),       # Resize to 256x256
                transforms.ToTensor(),               # Convert to PyTorch tensor (C x H x W)
                transforms.Normalize(                # Normalize using ImageNet stats
                    mean=[0.485, 0.456, 0.406], 
                    std=[0.229, 0.224, 0.225]
                )
            ])

        # Define your transform method
        def transform(self, x, *args, **kwargs):
            """A simple transform function."""
            return self.torch_transform(x)


dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuffle=shuffle)

This makes it easier to insert preprocessing logic directly into the streaming pipeline.

#618


πŸ“– AWS S3 Streaming Docs (with boto3 & unsigned requests Example)

The documentation now includes a clear example of how to stream datasets from AWS S3 using boto3, including support for unsigned requests. It also prioritizes boto3 in the list of options for better clarity.

import botocore
from litdata import StreamingDataset

storage_options = {
    "config": botocore.config.Config(
        retries={"max_attempts": 1000, "mode": "adaptive"},
        signature_version=botocore.UNSIGNED,
    )
}

dataset = StreamingDataset(
    input_dir="s3://pl-flash-data/optimized_tiny_imagenet",
    storage_options=storage_options,
)

#628


πŸ“– Batching Methods in CombinedStreamingDataset

The CombinedStreamingDataset supports two different batching methods through the batching_method parameter:

Stratified Batching (Default):
With batching_method="stratified" (the default), each batch contains samples from multiple datasets according to the specified weights:

# Default stratified batching - batches mix samples from all datasets
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="stratified"  # This is the default
)

Per-Stream Batching:
With batching_method="per_stream", each batch contains samples exclusively from a single dataset. This is useful when datasets have different shapes or structures:

# Per-stream batching - each batch contains samples from only one dataset
combined_dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2], 
    batching_method="per_stream"
)

#609


πŸ› Bug Fixes

  • Fixed breaking tqdm progress bar in optimizing dataset

    See before v/s after
    • v0.2.49
    Screenshot 2025-06-11 at 8 02 37β€―PM
    • v0.2.50
    Screenshot 2025-06-12 at 4 11 02β€―AM

    #619


  • Suppressed multiple lightning-sdk warnings.

    See before v/s after
    • v0.2.49:
    Screenshot 2025-06-24 at 4 46 05β€―PM
    • v0.2.50:
    Screenshot 2025-06-24 at 4 46 51β€―PM

    #633


  • Fixed FileNotFoundError in file locking for downloader and cache systems.
    #615, #617

πŸ§ͺ Testing & CI

  • Python 3.12 and 3.13 now supported in CI matrix
    #589
  • Test durations now logged for debugging
    #614
  • Added missing CI dependencies.
    #634
  • Refactored large, slow tests to reduce CI runtime
    #629, #632

πŸ“Ž Minor Improvements

  • Updated bug report template for easier Lightning Studio reproduction
    #611

πŸ“¦ Dependency Updates

  • mosaicml-streaming: 0.8.1 β†’ 0.11.0
    #624
  • transformers: <4.50.0 β†’ <4.53.0
    #623
  • pytest: 8.3.* β†’ 8.4.*
    #625

πŸ§‘β€πŸ’» Contributors

Thanks to everyone who contributed to this release!
Special thanks to @bhimrazy, @deependujha, @Borda, and @dependabot.


What's Changed

  • πŸ•’ Add Test Duration Reporting to Pytest in CI by @bhimrazy in #614
  • Update bug report template with Lightning Studio sharing instructions by @bhimrazy in #611
  • docs: Add documentation for batching methods in CombinedStreamingDataset by @bhimrazy in #609
  • fix: suppress FileNotFoundError when acquiring file lock for count file by @bhimrazy in #615
  • chore: suppress FileNotFoundError for locks in downloader classes by @bhimrazy in #617
  • Add Dependabot for Pip & GitHub Actions by @Borda in #621
  • chore(deps): update pytest requirement from ==8.3.* to ==8.4.* by @dependabot in #625
  • chore(deps): bump mosaicml-streaming from 0.8.1 to 0.11.0 by @dependabot in #624
  • chore(deps): update transformers requirement from <4.50.0 to <4.53.0 by @dependabot in #623
  • chore(deps): bump the gha-updates group with 2 updates by @dependabot in #622
  • Feat: add transform support for StreamingDataset by @deependujha in #618
  • fix: breaking tqdm progress bar in optimizing dataset by @deependujha in #619
  • upd: Optimize test (test_dataset_for_text_tokens_with_large_num_chunks) to reduce time consumption by @bhimrazy in #629
  • docs: Update documentation for AWS S3 dataset streaming with boto3 (including unsigned requests) by @bhimrazy in #628
  • CI: update testing matrix for Python versions (3.12 & 3.13) by @bhimrazy in #589
  • fix: specify test dependencies required for CI by @deependujha in #634
  • fix: multiplelightning-sdk update warning by @deependujha in #633
  • Refactor dataset preparation fixture to avoid redundancy and limit test parametrization to reduce time by @bhimrazy in #632
  • feat: fast random access for streamingDataset without chunk downloading by @deependujha in #631
  • bump version 0.2.50 by @deependujha in #637

Full Changelog: v0.2.49...v0.2.50