Skip to content

Conversation

@deependujha
Copy link
Collaborator

@deependujha deependujha commented Jun 17, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #14

  • Supports AWS & GCloud.
  • This pr only aims at adding support for PyTreeLoader and encrypted/compressed datasets are not supported.

Visualize

import litdata as ld

uri = "gs://deependu-gcp-first-bucket/simple_data"
ds = ld.StreamingDataset(uri, cache_dir="my_cache")

# when iterating, check `my_cache`. it shouldn't download chunks
for i in range(0,1000, 100):
    print(i, ds[i])

# it should download chunks now
for data in ds:
    print(data)

Benchmark

for the fast-random-access pr, chunk size: 64MB (Done on lightning studio):

  • main branch: 20-22 seconds for 10 random access
  • fast-random-access : 5-6 seconds for 10 random access

Note: I tried random access to have enough separation to make sure not all random accesses are from the same chunk.


  • For single item

0.83 seconds v/s 2 seconds

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Jun 17, 2025

Codecov Report

Attention: Patch coverage is 61.38614% with 39 lines in your changes missing coverage. Please review.

Project coverage is 83%. Comparing base (82bf020) to head (29fefcd).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #631   +/-   ##
===================================
- Coverage    83%    83%   -0%     
===================================
  Files        43     43           
  Lines      6611   6696   +85     
===================================
+ Hits       5507   5553   +46     
- Misses     1104   1143   +39     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@deependujha deependujha requested a review from Copilot June 18, 2025 07:02

This comment was marked as outdated.

@deependujha deependujha requested a review from Copilot June 18, 2025 11:05

This comment was marked as outdated.

@deependujha deependujha requested a review from Copilot June 24, 2025 12:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds fast random access for StreamingDataset by introducing a no_store flag and byte-range fetching to avoid downloading entire chunks.

  • Introduces no_store in BinaryReader, Cache, and StreamingDataset to conditionally fetch only requested item bytes.
  • Implements download_bytes support in S3Downloader and GCPDownloader, and adds read_item_bytes in BinaryReader.
  • Adds tests for reader.read_item_bytes and config.download_chunk_bytes_from_index.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/streaming/test_reader.py Added test_reader_read_bytes to validate per-item byte reads.
tests/streaming/test_config.py Added test_config_download_chunk_bytes for config byte reads.
src/litdata/streaming/reader.py Added no_store, read_item_bytes, and refactored read.
src/litdata/streaming/item_loader.py Added default and concrete load_item_from_bytes methods.
src/litdata/streaming/downloader.py Added base download_bytes, S3 and GCP implementations.
src/litdata/streaming/dataset.py Propagated no_store and added slice support in __getitem__.
src/litdata/streaming/config.py Added download_chunk_bytes_from_index for local/remote files.
src/litdata/streaming/cache.py Propagated no_store through Cache.
Comments suppressed due to low confidence (2)

tests/streaming/test_reader.py:7

  • The test uses os.path.join but os is not imported. Please add import os at the top of this file.
import pytest

src/litdata/streaming/item_loader.py:110

  • There are two load_item_from_bytes definitions in this class (one stub and one concrete). Consolidate into a single method to avoid unexpected overrides.
    def load_item_from_bytes(

@deependujha deependujha requested a review from bhimrazy June 24, 2025 21:04
Copy link
Collaborator

@bhimrazy bhimrazy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

I’ve added a few comments and questions.
It would also be helpful to include benchmarks comparing a few individual item loads with and without on_demand_bytes in the description.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really neat !

@deependujha deependujha merged commit f03ddb7 into Lightning-AI:main Jun 27, 2025
35 checks passed
@deependujha deependujha deleted the feat/fast-random-access branch June 27, 2025 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast random access for StreamingDataset

4 participants