chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

bhimrazy · 2025-04-27T21:38:08Z

What does this PR do?

Introduces benchmark scripts for streaming ImageNet using LitData and FFCV
Provides a performance comparison between the two frameworks

Benchmarks for LitData vs FFCV

Speed to stream Imagenet 1.2M from local disk with ffcv vs LitData:

Framework	Dataset Mode	Dataset Size @ 256px	Images / sec 1st Epoch (float32)	Images / sec 2nd Epoch (float32)
LitData	PIL RAW	168 GB	6647	6398
LitData	JPEG 90%	12 GB	6553	6537
ffcv (os_cache=True)	RAW	170 GB	7263	6698
ffcv (os_cache=False)	RAW	170 GB	7556	8169
ffcv(os_cache=True)	JPEG 90%	20 GB	7653	8051
ffcv(os_cache=False)	JPEG 90%	20 GB	8149	8607

Benchmark Logs

LitData | JPEG 90% | 12 GB

⚡~/litData/benchmarks aws s3 cp --recursive s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg
/datasets/imagenet-1m-litdata/train_256_jpg_90 data/imagenet-1m-litdata/train_256_jpg_90

⚡~/litData/benchmarks python litdata/stream_imagenet.py --input_dir data/imagenet-1m-litdata/train_256_jpg_90 
[INFO] Running streaming benchmark with arguments: Namespace(input_dir='data/imagenet-1m-litdata/train_256_jpg_90', cache_dir='/cache/chunks', dtype='float32', batch_size=256, num_workers=32, drop_last=False, epochs=2, max_cache_size='200GB', use_pil=False, clear_cache=True)
Seed set to 42
[INFO] Clearing cache directory: /cache/chunks
[INFO] Initializing streaming dataset from: data/imagenet-1m-litdata/train_256_jpg_90
[INFO] Starting benchmark for 2 epoch(s) with batch size 256 and 32 workers.
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:15<00:00, 25.60it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 195.51s (6552.99 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:15<00:00, 25.54it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 195.99s (6536.99 images/sec)
[INFO] Clearing cache directory after benchmark: /cache/chunks
[INFO] Finished streaming benchmark.

LitData | RAW | 168 GB

⚡~/litData/benchmarks aws s3 cp --recursive s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-litdata/train_256_raw_pil data/imagenet-1m-litdata/train_256_raw_pil

⚡~/litData/benchmarks python litdata/stream_imagenet.py --input_dir data/imagenet-1m-litdata/train_256_raw_pil --use_pil 
Seed set to 42
[INFO] Running streaming benchmark with arguments: Namespace(input_dir='data/imagenet-1m-litdata/train_256_raw_pil', cache_dir='/cache/chunks', dtype='float32', batch_size=256, num_workers=32, drop_last=False, epochs=2, max_cache_size='200GB', use_pil=True, clear_cache=True)
[INFO] Clearing cache directory: /cache/chunks
[INFO] Initializing streaming dataset from: data/imagenet-1m-litdata/train_256_raw_pil
[INFO] Starting benchmark for 2 epoch(s) with batch size 256 and 32 workers.
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:12<00:00, 25.97it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 192.74s (6647.00 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:20<00:00, 24.99it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 200.24s (6398.02 images/sec)
[INFO] Clearing cache directory after benchmark: /cache/chunks
[INFO] Finished streaming benchmark.

FFCV | JPEG 90% | 20 GB

⚡~/litData/benchmarks aws s3 cp s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-ffcv/train_256_100.0_90.ffcv data/imagenet-1m-ffcv/train_256_100.0_90.ffcv

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_100.0_90.ffcv --cfg.os_cache True 
┌ Arguments defined───────────────────────────────────────────────┐
│ Parameter       │ Value                                         │
├─────────────────┼───────────────────────────────────────────────┤
│ cfg.data_path   │ data/imagenet-1m-ffcv/train_256_100.0_90.ffcv │
│ cfg.batch_size  │ 256                                           │
│ cfg.num_workers │ 32                                            │
│ cfg.drop_last   │ False                                         │
│ cfg.epochs      │ 2                                             │
│ cfg.order       │ SEQUENTIAL                                    │
│ cfg.os_cache    │ True                                          │
│ cfg.normalize   │ False                                         │
└─────────────────┴───────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:47<00:00, 29.90it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 167.40s (7653.53 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:39<00:00, 31.45it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 159.12s (8051.64 images/sec)
[INFO] Finished streaming benchmark.

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_100.0_90.ffcv
┌ Arguments defined───────────────────────────────────────────────┐
│ Parameter       │ Value                                         │
├─────────────────┼───────────────────────────────────────────────┤
│ cfg.data_path   │ data/imagenet-1m-ffcv/train_256_100.0_90.ffcv │
│ cfg.batch_size  │ 256                                           │
│ cfg.num_workers │ 32                                            │
│ cfg.drop_last   │ False                                         │
│ cfg.epochs      │ 2                                             │
│ cfg.order       │ SEQUENTIAL                                    │
│ cfg.os_cache    │ False                                         │
│ cfg.normalize   │ False                                         │
└─────────────────┴───────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:37<00:00, 31.84it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 157.21s (8149.52 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:28<00:00, 33.62it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 148.85s (8607.16 images/sec)
[INFO] Finished streaming benchmark.

FFCV | RAW | 170GB

⚡~/litData/benchmarks aws s3 cp s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-ffcv/train_256_0.0_100.ffcv data/imagenet-1m-ffcv/train_256_0.0_100.ffcv

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_0.0_100.ffcv --cfg.os_cache True
┌ Arguments defined──────────────────────────────────────────────┐
│ Parameter       │ Value                                        │
├─────────────────┼──────────────────────────────────────────────┤
│ cfg.data_path   │ data/imagenet-1m-ffcv/train_256_0.0_100.ffcv │
│ cfg.batch_size  │ 256                                          │
│ cfg.num_workers │ 32                                           │
│ cfg.drop_last   │ False                                        │
│ cfg.epochs      │ 2                                            │
│ cfg.order       │ SEQUENTIAL                                   │
│ cfg.os_cache    │ True                                         │
│ cfg.normalize   │ False                                        │
└─────────────────┴──────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:56<00:00, 28.38it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 176.37s (7263.97 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:11<00:00, 26.17it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 191.26s (6698.55 images/sec)
[INFO] Finished streaming benchmark.

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_0.0_100.ffcv
┌ Arguments defined──────────────────────────────────────────────┐
│ Parameter       │ Value                                        │
├─────────────────┼──────────────────────────────────────────────┤
│ cfg.data_path   │ data/imagenet-1m-ffcv/train_256_0.0_100.ffcv │
│ cfg.batch_size  │ 256                                          │
│ cfg.num_workers │ 32                                           │
│ cfg.drop_last   │ False                                        │
│ cfg.epochs      │ 2                                            │
│ cfg.order       │ SEQUENTIAL                                   │
│ cfg.os_cache    │ False                                        │
│ cfg.normalize   │ False                                        │
└─────────────────┴──────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:49<00:00, 29.52it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 169.54s (7556.78 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:36<00:00, 31.92it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 156.82s (8169.70 images/sec)
[INFO] Finished streaming benchmark.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2025-04-27T21:57:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (de1ccce) to head (d2bc95d).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #572   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        41     41           
  Lines      6135   6135           
===================================
  Hits       4835   4835           
  Misses     1300   1300

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tchaton

Can you add a folder with the files to do the benchmarks ?

bhimrazy · 2025-05-13T18:44:37Z

Can you add a folder with the files to do the benchmarks ?

Sure @tchaton.

…PEG and PIL formats

for more information, see https://pre-commit.ci

Copilot

Pull Request Overview

This PR introduces benchmark scripts for streaming and optimizing the ImageNet dataset using both LitData and FFCV, and updates documentation with usage instructions and a performance comparison.

Added LitData scripts: dataset optimization and streaming benchmarks.
Added FFCV scripts: dataset conversion, writing to FFCV format, streaming benchmarks, and installer.
Updated READMEs and main documentation with usage examples and a local performance comparison table.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
benchmarks/litdata/stream_imagenet.py	New LitData streaming benchmark script
benchmarks/litdata/optimize_imagenet.py	New LitData dataset optimization script
benchmarks/litdata/README.md	Usage docs for LitData scripts
benchmarks/ffcv/write_imagenet.py	Script to write datasets in FFCV format
benchmarks/ffcv/stream_imagenet.py	Script to stream and benchmark FFCV datasets
benchmarks/ffcv/convert_imagenet.py	Script to convert raw ImageNet synset folders
benchmarks/ffcv/install_ffcv.sh	Installer for FFCV dependencies
benchmarks/ffcv/README.md	Usage docs for FFCV scripts
benchmarks/README.md	Top-level benchmarks overview
README.md	Added local disk performance comparison table

Comments suppressed due to low confidence (1)

benchmarks/litdata/optimize_imagenet.py:103

When --resize is enabled but --resize_size is not provided, resize_size defaults to None and no resizing occurs silently. Consider validating that resize_size is provided when --resize is set and erroring otherwise.

parser.add_argument("--resize_size", type=int, nargs="+", default=None, help="Resize size: int for max dimension (aspect ratio preserved), or two ints for (width height)")

benchmarks/litdata/stream_imagenet.py

benchmarks/ffcv/stream_imagenet.py

benchmarks/ffcv/README.md

for more information, see https://pre-commit.ci

tchaton

Real nice !

benchmarks/litdata/stream_imagenet.py

benchmarks/litdata/optimize_imagenet.py

docs: Add performance comparison for streaming Imagenet dataset

fb4fa8d

bhimrazy requested review from tchaton, lantiga and justusschock as code owners April 27, 2025 21:38

Merge branch 'main' into docs/litdata-vs-ffcv-benchmarks

b4cb1c3

tchaton reviewed May 13, 2025

View reviewed changes

bhimrazy added 7 commits May 14, 2025 06:51

add optimize script for imagenet

2f504cb

update optimize script

bd8a087

fix: update print statements for clarity in optimize_imagenet script

7cc6df7

feat: add streaming benchmark for ImageNet dataset with support for J…

b636c00

…PEG and PIL formats

update optimize script

abd3e15

add readme

05a24e3

Merge branch 'main' into docs/litdata-vs-ffcv-benchmarks

0bb5a61

bhimrazy marked this pull request as draft May 14, 2025 08:21

pre-commit-ci bot and others added 14 commits May 14, 2025 08:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

9106307

for more information, see https://pre-commit.ci

update litdata optimize

bf78b3f

add script to convert dataset

d42be4f

add write and stream script

f3e9123

add install script

b6150c3

update to include dropt last

1bddc9d

update readme

df12160

add readme

611c256

update readme

efc4411

update readme

9a96239

update readme

b1d0b7d

update readme

0d78a16

[pre-commit.ci] auto fixes from pre-commit.com hooks

92dbc26

for more information, see https://pre-commit.ci

add missing docstring

be6b007

bhimrazy changed the title ~~[wip] docs: Add performance comparison for streaming Imagenet dataset~~ [wip] docs: Add benchmark scripts and performance comparison (litdata vs ffcv) for streaming Imagenet dataset May 15, 2025

bhimrazy changed the title ~~[wip] docs: Add benchmark scripts and performance comparison (litdata vs ffcv) for streaming Imagenet dataset~~ [WIP] Docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 15, 2025

bhimrazy changed the title ~~[WIP] Docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet~~ [wip] docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 15, 2025

bhimrazy requested review from Copilot and grid-ai May 15, 2025 09:23

Copilot AI reviewed May 15, 2025

View reviewed changes

benchmarks/litdata/stream_imagenet.py Outdated Show resolved Hide resolved

benchmarks/ffcv/stream_imagenet.py Outdated Show resolved Hide resolved

benchmarks/ffcv/stream_imagenet.py Outdated Show resolved Hide resolved

benchmarks/ffcv/README.md Outdated Show resolved Hide resolved

bhimrazy and others added 10 commits May 17, 2025 05:13

update ffcv stream

d7b1b5a

update stream litdata

1c6adff

update

e27f29f

update stream for ffcv

1f4c5ba

update script

3804b5c

add gitignore

6da072d

update

09bfe23

update

34141b7

update

eea69e8

[pre-commit.ci] auto fixes from pre-commit.com hooks

d2bc95d

for more information, see https://pre-commit.ci

bhimrazy requested review from deependujha and tchaton May 17, 2025 05:13

bhimrazy changed the title ~~[wip] docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet~~ docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 17, 2025

bhimrazy marked this pull request as ready for review May 17, 2025 05:14

bhimrazy changed the title ~~docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet~~ chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 17, 2025

tchaton approved these changes May 17, 2025

View reviewed changes

deependujha reviewed May 17, 2025

View reviewed changes

benchmarks/litdata/stream_imagenet.py Show resolved Hide resolved

deependujha reviewed May 17, 2025

View reviewed changes

benchmarks/litdata/stream_imagenet.py Show resolved Hide resolved

deependujha reviewed May 17, 2025

View reviewed changes

benchmarks/litdata/optimize_imagenet.py Show resolved Hide resolved

deependujha approved these changes May 17, 2025

View reviewed changes

bhimrazy merged commit f4f0b7f into Lightning-AI:main May 17, 2025
42 checks passed

bhimrazy deleted the docs/litdata-vs-ffcv-benchmarks branch May 17, 2025 14:01

bhimrazy restored the docs/litdata-vs-ffcv-benchmarks branch May 25, 2025 17:30

bhimrazy deleted the docs/litdata-vs-ffcv-benchmarks branch May 25, 2025 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

Uh oh!

bhimrazy commented Apr 27, 2025 •

edited

Loading

Uh oh!

codecov bot commented Apr 27, 2025 •

edited

Loading

Uh oh!

tchaton left a comment

Uh oh!

bhimrazy commented May 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

Uh oh!

Conversation

bhimrazy commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Benchmarks for LitData vs FFCV

PR review

Did you have fun?

Uh oh!

codecov bot commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

bhimrazy commented May 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhimrazy commented Apr 27, 2025 •

edited

Loading

codecov bot commented Apr 27, 2025 •

edited

Loading