feat: fast random access for streamingDataset without chunk downloading #631

deependujha · 2025-06-17T11:19:19Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #14

Supports AWS & GCloud.
This pr only aims at adding support for PyTreeLoader and encrypted/compressed datasets are not supported.

Visualize

import litdata as ld

uri = "gs://deependu-gcp-first-bucket/simple_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")

# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
    print(i, ds[i])

# it should download chunks now
for data in ds:
    print(data)

Benchmark

for the fast-random-access pr, chunk size: 64MB (Done on lightning studio):

main branch: 20-22 seconds for 10 random access
fast-random-access : 5-6 seconds for 10 random access

Note: I tried random access to have enough separation to make sure not all random accesses are from the same chunk.

For single item

0.83 seconds v/s 2 seconds

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2025-06-17T12:15:24Z

Codecov Report

Attention: Patch coverage is 61.38614% with 39 lines in your changes missing coverage. Please review.

Project coverage is 83%. Comparing base (82bf020) to head (29fefcd).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #631   +/-   ##
===================================
- Coverage    83%    83%   -0%     
===================================
  Files        43     43           
  Lines      6611   6696   +85     
===================================
+ Hits       5507   5553   +46     
- Misses     1104   1143   +39

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/litdata/streaming/cache.py

src/litdata/streaming/reader.py

Copilot

Pull Request Overview

Adds fast random access for StreamingDataset by introducing a no_store flag and byte-range fetching to avoid downloading entire chunks.

Introduces no_store in BinaryReader, Cache, and StreamingDataset to conditionally fetch only requested item bytes.
Implements download_bytes support in S3Downloader and GCPDownloader, and adds read_item_bytes in BinaryReader.
Adds tests for reader.read_item_bytes and config.download_chunk_bytes_from_index.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/streaming/test_reader.py	Added `test_reader_read_bytes` to validate per-item byte reads.
tests/streaming/test_config.py	Added `test_config_download_chunk_bytes` for config byte reads.
src/litdata/streaming/reader.py	Added `no_store`, `read_item_bytes`, and refactored `read`.
src/litdata/streaming/item_loader.py	Added default and concrete `load_item_from_bytes` methods.
src/litdata/streaming/downloader.py	Added base `download_bytes`, S3 and GCP implementations.
src/litdata/streaming/dataset.py	Propagated `no_store` and added slice support in `__getitem__`.
src/litdata/streaming/config.py	Added `download_chunk_bytes_from_index` for local/remote files.
src/litdata/streaming/cache.py	Propagated `no_store` through `Cache`.

Comments suppressed due to low confidence (2)

tests/streaming/test_reader.py:7

The test uses os.path.join but os is not imported. Please add import os at the top of this file.

import pytest

src/litdata/streaming/item_loader.py:110

There are two load_item_from_bytes definitions in this class (one stub and one concrete). Consolidate into a single method to avoid unexpected overrides.

    def load_item_from_bytes(

src/litdata/streaming/cache.py

bhimrazy

Looking good so far!

I’ve added a few comments and questions.
It would also be helpful to include benchmarks comparing a few individual item loads with and without on_demand_bytes in the description.

src/litdata/streaming/config.py

src/litdata/streaming/dataset.py

src/litdata/streaming/downloader.py

tchaton

really neat !

fast random access for s3 works

1fac781

deependujha requested review from justusschock, lantiga and tchaton as code owners June 17, 2025 11:19

deependujha added 4 commits June 17, 2025 17:06

no_chunk_download supports gcloud

d2d7588

update

fde6516

update

bd55a5c

update

6742150

Merge branch 'main' into feat/fast-random-access

8081927

deependujha requested a review from Copilot June 18, 2025 07:02

tchaton reviewed Jun 18, 2025

View reviewed changes

src/litdata/streaming/cache.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

bhimrazy reviewed Jun 18, 2025

View reviewed changes

src/litdata/streaming/reader.py Outdated Show resolved Hide resolved

deependujha added 2 commits June 18, 2025 16:11

nitpick

ad0ceae

tests

08d5da1

deependujha requested a review from Copilot June 18, 2025 11:05

This comment was marked as outdated.

Sign in to view

deependujha added 5 commits June 18, 2025 16:57

update

3cda9bb

update

d7aede1

update

08c90cc

update

0359e42

update

ff8b9cd

deependujha requested a review from Copilot June 24, 2025 12:46

Copilot AI reviewed Jun 24, 2025

View reviewed changes

Borda reviewed Jun 24, 2025

View reviewed changes

src/litdata/streaming/cache.py Outdated Show resolved Hide resolved

deependujha added 2 commits June 24, 2025 23:57

rename no_store to on_demand_bytes

2dd6e28

Merge branch 'main' into feat/fast-random-access

1f7ab0c

deependujha requested a review from bhimrazy June 24, 2025 21:04

bhimrazy reviewed Jun 25, 2025

View reviewed changes

src/litdata/streaming/config.py Show resolved Hide resolved

src/litdata/streaming/dataset.py Show resolved Hide resolved

src/litdata/streaming/dataset.py Show resolved Hide resolved

src/litdata/streaming/downloader.py Outdated Show resolved Hide resolved

tchaton approved these changes Jun 25, 2025

View reviewed changes

deependujha added 3 commits June 26, 2025 00:53

Merge branch 'main' into feat/fast-random-access

0cf5a15

make sure client exists

9297fc8

fallback for download_bytes method

29fefcd

deependujha merged commit f03ddb7 into Lightning-AI:main Jun 27, 2025
35 checks passed

deependujha deleted the feat/fast-random-access branch June 27, 2025 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: fast random access for streamingDataset without chunk downloading #631

feat: fast random access for streamingDataset without chunk downloading #631

Uh oh!

deependujha commented Jun 17, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

bhimrazy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: fast random access for streamingDataset without chunk downloading #631

feat: fast random access for streamingDataset without chunk downloading #631

Uh oh!

Conversation

deependujha commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Visualize

Benchmark

PR review

Did you have fun?

Uh oh!

codecov bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

bhimrazy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deependujha commented Jun 17, 2025 •

edited

Loading

codecov bot commented Jun 17, 2025 •

edited

Loading