Skip to content

chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet #572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
May 17, 2025

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Apr 27, 2025

What does this PR do?

  • Introduces benchmark scripts for streaming ImageNet using LitData and FFCV
  • Provides a performance comparison between the two frameworks

Benchmarks for LitData vs FFCV

Speed to stream Imagenet 1.2M from local disk with ffcv vs LitData:

Framework Dataset Mode Dataset Size @ 256px Images / sec 1st Epoch (float32) Images / sec 2nd Epoch (float32)
LitData PIL RAW 168 GB 6647 6398
LitData JPEG 90% 12 GB 6553 6537
ffcv (os_cache=True) RAW 170 GB 7263 6698
ffcv (os_cache=False) RAW 170 GB 7556 8169
ffcv(os_cache=True) JPEG 90% 20 GB 7653 8051
ffcv(os_cache=False) JPEG 90% 20 GB 8149 8607
Benchmark Logs

LitData | JPEG 90% | 12 GB

~/litData/benchmarks aws s3 cp --recursive s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg
/datasets/imagenet-1m-litdata/train_256_jpg_90 data/imagenet-1m-litdata/train_256_jpg_90~/litData/benchmarks python litdata/stream_imagenet.py --input_dir data/imagenet-1m-litdata/train_256_jpg_90 
[INFO] Running streaming benchmark with arguments: Namespace(input_dir='data/imagenet-1m-litdata/train_256_jpg_90', cache_dir='/cache/chunks', dtype='float32', batch_size=256, num_workers=32, drop_last=False, epochs=2, max_cache_size='200GB', use_pil=False, clear_cache=True)
Seed set to 42
[INFO] Clearing cache directory: /cache/chunks
[INFO] Initializing streaming dataset from: data/imagenet-1m-litdata/train_256_jpg_90
[INFO] Starting benchmark for 2 epoch(s) with batch size 256 and 32 workers.
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:15<00:00, 25.60it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 195.51s (6552.99 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:15<00:00, 25.54it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 195.99s (6536.99 images/sec)
[INFO] Clearing cache directory after benchmark: /cache/chunks
[INFO] Finished streaming benchmark.

LitData | RAW | 168 GB

~/litData/benchmarks aws s3 cp --recursive s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-litdata/train_256_raw_pil data/imagenet-1m-litdata/train_256_raw_pil~/litData/benchmarks python litdata/stream_imagenet.py --input_dir data/imagenet-1m-litdata/train_256_raw_pil --use_pil 
Seed set to 42
[INFO] Running streaming benchmark with arguments: Namespace(input_dir='data/imagenet-1m-litdata/train_256_raw_pil', cache_dir='/cache/chunks', dtype='float32', batch_size=256, num_workers=32, drop_last=False, epochs=2, max_cache_size='200GB', use_pil=True, clear_cache=True)
[INFO] Clearing cache directory: /cache/chunks
[INFO] Initializing streaming dataset from: data/imagenet-1m-litdata/train_256_raw_pil
[INFO] Starting benchmark for 2 epoch(s) with batch size 256 and 32 workers.
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:12<00:00, 25.97it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 192.74s (6647.00 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:20<00:00, 24.99it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 200.24s (6398.02 images/sec)
[INFO] Clearing cache directory after benchmark: /cache/chunks
[INFO] Finished streaming benchmark.

FFCV | JPEG 90% | 20 GB

~/litData/benchmarks aws s3 cp s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-ffcv/train_256_100.0_90.ffcv data/imagenet-1m-ffcv/train_256_100.0_90.ffcv~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_100.0_90.ffcv --cfg.os_cache TrueArguments defined───────────────────────────────────────────────┐
│ ParameterValue                                         │
├─────────────────┼───────────────────────────────────────────────┤
│ cfg.data_pathdata/imagenet-1m-ffcv/train_256_100.0_90.ffcv │
│ cfg.batch_size256                                           │
│ cfg.num_workers32                                            │
│ cfg.drop_lastFalse                                         │
│ cfg.epochs2                                             │
│ cfg.orderSEQUENTIAL                                    │
│ cfg.os_cacheTrue                                          │
│ cfg.normalizeFalse                                         │
└─────────────────┴───────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:47<00:00, 29.90it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 167.40s (7653.53 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:39<00:00, 31.45it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 159.12s (8051.64 images/sec)
[INFO] Finished streaming benchmark.

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_100.0_90.ffcvArguments defined───────────────────────────────────────────────┐
│ ParameterValue                                         │
├─────────────────┼───────────────────────────────────────────────┤
│ cfg.data_pathdata/imagenet-1m-ffcv/train_256_100.0_90.ffcv │
│ cfg.batch_size256                                           │
│ cfg.num_workers32                                            │
│ cfg.drop_lastFalse                                         │
│ cfg.epochs2                                             │
│ cfg.orderSEQUENTIAL                                    │
│ cfg.os_cacheFalse                                         │
│ cfg.normalizeFalse                                         │
└─────────────────┴───────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:37<00:00, 31.84it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 157.21s (8149.52 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:28<00:00, 33.62it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 148.85s (8607.16 images/sec)
[INFO] Finished streaming benchmark.

FFCV | RAW | 170GB

~/litData/benchmarks aws s3 cp s3://grid-cloud-litng-ai-03/projects/01j12bk075bs7e771jhmgvz7eg/datasets/imagenet-1m-ffcv/train_256_0.0_100.ffcv data/imagenet-1m-ffcv/train_256_0.0_100.ffcv~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_0.0_100.ffcv --cfg.os_cache TrueArguments defined──────────────────────────────────────────────┐
│ ParameterValue                                        │
├─────────────────┼──────────────────────────────────────────────┤
│ cfg.data_pathdata/imagenet-1m-ffcv/train_256_0.0_100.ffcv │
│ cfg.batch_size256                                          │
│ cfg.num_workers32                                           │
│ cfg.drop_lastFalse                                        │
│ cfg.epochs2                                            │
│ cfg.orderSEQUENTIAL                                   │
│ cfg.os_cacheTrue                                         │
│ cfg.normalizeFalse                                        │
└─────────────────┴──────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:56<00:00, 28.38it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 176.37s (7263.97 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [03:11<00:00, 26.17it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 191.26s (6698.55 images/sec)
[INFO] Finished streaming benchmark.

⚡~/litData/benchmarks python ffcv/stream_imagenet.py --cfg.data_path data/imagenet-1m-ffcv/train_256_0.0_100.ffcvArguments defined──────────────────────────────────────────────┐
│ ParameterValue                                        │
├─────────────────┼──────────────────────────────────────────────┤
│ cfg.data_pathdata/imagenet-1m-ffcv/train_256_0.0_100.ffcv │
│ cfg.batch_size256                                          │
│ cfg.num_workers32                                           │
│ cfg.drop_lastFalse                                        │
│ cfg.epochs2                                            │
│ cfg.orderSEQUENTIAL                                   │
│ cfg.os_cacheFalse                                        │
│ cfg.normalizeFalse                                        │
└─────────────────┴──────────────────────────────────────────────┘
Seed set to 42
[INFO] Starting streaming benchmark...
Epoch 1/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:49<00:00, 29.52it/s]
[RESULT] Epoch 1: Streamed 1281167 samples in 169.54s (7556.78 images/sec)
Epoch 2/2: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5005/5005 [02:36<00:00, 31.92it/s]
[RESULT] Epoch 2: Streamed 1281167 samples in 156.82s (8169.70 images/sec)
[INFO] Finished streaming benchmark.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Copy link

codecov bot commented Apr 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (de1ccce) to head (d2bc95d).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #572   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        41     41           
  Lines      6135   6135           
===================================
  Hits       4835   4835           
  Misses     1300   1300           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a folder with the files to do the benchmarks ?

@bhimrazy
Copy link
Collaborator Author

Can you add a folder with the files to do the benchmarks ?

Sure @tchaton.

@bhimrazy bhimrazy marked this pull request as draft May 14, 2025 08:21
@bhimrazy bhimrazy changed the title [wip] docs: Add performance comparison for streaming Imagenet dataset [wip] docs: Add benchmark scripts and performance comparison (litdata vs ffcv) for streaming Imagenet dataset May 15, 2025
@bhimrazy bhimrazy changed the title [wip] docs: Add benchmark scripts and performance comparison (litdata vs ffcv) for streaming Imagenet dataset [WIP] Docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 15, 2025
@bhimrazy bhimrazy changed the title [WIP] Docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet [wip] docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 15, 2025
@bhimrazy bhimrazy requested review from Copilot and grid-ai May 15, 2025 09:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces benchmark scripts for streaming and optimizing the ImageNet dataset using both LitData and FFCV, and updates documentation with usage instructions and a performance comparison.

  • Added LitData scripts: dataset optimization and streaming benchmarks.
  • Added FFCV scripts: dataset conversion, writing to FFCV format, streaming benchmarks, and installer.
  • Updated READMEs and main documentation with usage examples and a local performance comparison table.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
benchmarks/litdata/stream_imagenet.py New LitData streaming benchmark script
benchmarks/litdata/optimize_imagenet.py New LitData dataset optimization script
benchmarks/litdata/README.md Usage docs for LitData scripts
benchmarks/ffcv/write_imagenet.py Script to write datasets in FFCV format
benchmarks/ffcv/stream_imagenet.py Script to stream and benchmark FFCV datasets
benchmarks/ffcv/convert_imagenet.py Script to convert raw ImageNet synset folders
benchmarks/ffcv/install_ffcv.sh Installer for FFCV dependencies
benchmarks/ffcv/README.md Usage docs for FFCV scripts
benchmarks/README.md Top-level benchmarks overview
README.md Added local disk performance comparison table
Comments suppressed due to low confidence (1)

benchmarks/litdata/optimize_imagenet.py:103

  • When --resize is enabled but --resize_size is not provided, resize_size defaults to None and no resizing occurs silently. Consider validating that resize_size is provided when --resize is set and erroring otherwise.
parser.add_argument("--resize_size", type=int, nargs="+", default=None, help="Resize size: int for max dimension (aspect ratio preserved), or two ints for (width height)")

@bhimrazy bhimrazy requested review from deependujha and tchaton May 17, 2025 05:13
@bhimrazy bhimrazy changed the title [wip] docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 17, 2025
@bhimrazy bhimrazy marked this pull request as ready for review May 17, 2025 05:14
@bhimrazy bhimrazy changed the title docs: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet chore: Add Benchmark Scripts and Performance Comparison of LitData vs FFCV for Streaming ImageNet May 17, 2025
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real nice !

@bhimrazy bhimrazy merged commit f4f0b7f into Lightning-AI:main May 17, 2025
42 checks passed
@bhimrazy bhimrazy deleted the docs/litdata-vs-ffcv-benchmarks branch May 17, 2025 14:01
@bhimrazy bhimrazy restored the docs/litdata-vs-ffcv-benchmarks branch May 25, 2025 17:30
@bhimrazy bhimrazy deleted the docs/litdata-vs-ffcv-benchmarks branch May 25, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants