[Train][Data] Add S3 URL data loading benchmarks for training ingest #60377

xinyuangui2 · 2026-01-21T18:12:08Z

Summary

Add new s3_url data format that lists JPEG files from S3 and downloads images via map_batches

Changes

S3 URL Data Loading

New s3_url image format that:
1. Lists JPEG files from s3://anyscale-imagenet/ILSVRC/Data/CLS-LOC using boto3
2. Creates a Ray dataset from file records (path, class)
3. Uses map_batches to download and process images from S3 URLs
This separates file listing from image downloading, enabling parallel downloads on CPU workers

New Test Variations

Added s3_url variations to existing training_ingest_benchmark-task=image_classification test

Release test

Test	Global throughput
skip_training.parquet	9717
skip_training.s3_url	5695

Test plan

Run name:^training_ingest_benchmark-task=image_classification\.skip_training\.s3_url$

gemini-code-assist

Code Review

This pull request adds new release tests for training ingest benchmarks on heterogeneous clusters, with both fixed-size and autoscaling configurations. The changes include a new test definition in release/release_tests.yaml and two corresponding cluster compute configuration files. The configurations seem correct for testing heterogeneous setups. My main feedback is on the new test definition, which has significant duplication in the script commands across its variations. I've suggested a refactoring to improve maintainability.

release/release_tests.yaml

srinathk10

LGTM

srinathk10 · 2026-01-22T21:24:33Z

@justinvyu Could you pl also take a pass at this?

justinvyu

Thanks!

release/release_tests.yaml

release/train_tests/benchmark/image_classification/s3_url/factory.py

justinvyu · 2026-01-22T23:55:36Z

release/train_tests/benchmark/image_classification/s3_url/imagenet.py

+        for s3_url, wnid in zip(paths, classes):
+            # Parse S3 URL: s3://bucket/key
+            url_path = s3_url[5:] if s3_url.startswith("s3://") else s3_url
+            bucket, key = url_path.split("/", 1)
+
+            # Download image from S3
+            response = s3_client.get_object(Bucket=bucket, Key=key)
+            data = response["Body"].read()
+
+            # Decode and transform image
+            image_pil = Image.open(io.BytesIO(data)).convert("RGB")
+            image_tensor = pil_to_tensor(image_pil) / 255.0
+            processed_image = np.array(transform(image_tensor))
+            processed_images.append(processed_image)


this may be limiting throughput if we need to download one image at a time and then preprocess sequentially. is there an easy way to pipeline the downloading and transforming? can also do as a followup if needed, I just noticed the throughput is very low

I think the low throughput is expected because downloading images inside map is low-efficient itself. I will add a note here.

release/train_tests/benchmark/compute_configs/heterogenous_autoscaling_gpu_4x4_aws.yaml

justinvyu · 2026-01-23T00:00:51Z

Also, can you add some discussion about the results (are things as expected?) in the PR description?

justinvyu

Can we use this download expression that Ray Data has instead of implementing our own downloading? This one should handle async fetching and streaming. cc @goutamvenkat-anyscale

https://docs.ray.io/en/master/data/transforming-data.html#expressions-alpha
https://docs.ray.io/en/latest/data/api/doc/ray.data.expressions.download.html#ray.data.expressions.download

This should also remove the need for the hardcoded DEFAULT_MAP_BATCHES_BATCH_SIZE which we don't want users to have to set.

The performance should be within some % of the read_parquet/read_images performance. The initial "url" dataset shouldn't add that much overhead. Let's get this skip_training.s3_url number to something within ~10% of the other variants. We can also split up this PR to add the heterogeneous cluster setups first.

xinyuangui2 · 2026-01-23T20:16:44Z

Can we use this download expression that Ray Data has instead of implementing our own downloading? This one should handle async fetching and streaming. cc @goutamvenkat-anyscale

https://docs.ray.io/en/master/data/transforming-data.html#expressions-alpha https://docs.ray.io/en/latest/data/api/doc/ray.data.expressions.download.html#ray.data.expressions.download

This should also remove the need for the hardcoded DEFAULT_MAP_BATCHES_BATCH_SIZE which we don't want users to have to set.

The performance should be within some % of the read_parquet/read_images performance. The initial "url" dataset shouldn't add that much overhead. Let's get this skip_training.s3_url number to something within ~10% of the other variants. We can also split up this PR to add the heterogeneous cluster setups first.

Ok I will split this PR into 2.

Add a new data loading approach that: 1. Lists JPEG files from S3 using boto3 2. Creates a Ray dataset from the file records 3. Uses map_batches to download and process images from S3 NOTE: This implementation downloads images sequentially within each map_batches call. While concurrent downloads could improve throughput, this risks spawning too many threads when combined with Ray's parallelism. For production workloads, consider using Ray Data's native S3 reading. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: xgui <xgui@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

release/release_tests.yaml

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 · 2026-01-24T20:19:48Z

@justinvyu Updated. With the transform, the throughput is much faster than naively download. But it is still slower than read_parquet method.

gemini-code-assist bot reviewed Jan 21, 2026

View reviewed changes

release/release_tests.yaml Outdated Show resolved Hide resolved

xinyuangui2 changed the title ~~add autoscaling and heterogenous~~ add autoscaling and heterogenous + s3 url data factory Jan 21, 2026

xinyuangui2 changed the title ~~add autoscaling and heterogenous + s3 url data factory~~ [Train][Data] Add heterogeneous cluster and S3 URL data loading benchmarks for training ingest Jan 21, 2026

xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 22, 2026

xinyuangui2 marked this pull request as ready for review January 22, 2026 18:10

ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues release-test release test labels Jan 22, 2026

srinathk10 approved these changes Jan 22, 2026

View reviewed changes

srinathk10 requested a review from justinvyu January 22, 2026 21:25

justinvyu reviewed Jan 22, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

justinvyu reviewed Jan 23, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

xinyuangui2 force-pushed the add-autoscaling-release branch from defc55e to 5695dc7 Compare January 23, 2026 20:25

xinyuangui2 and others added 2 commits January 23, 2026 20:45

refactor

558dbe8

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'master' into add-autoscaling-release

feb3092

cursor bot reviewed Jan 23, 2026

View reviewed changes

release/release_tests.yaml Outdated Show resolved Hide resolved

xinyuangui2 added 2 commits January 23, 2026 20:55

revert refactor

f4f5a81

Signed-off-by: xgui <xgui@anyscale.com>

switch to download

7e878ce

Signed-off-by: xgui <xgui@anyscale.com>

xinyuangui2 changed the title ~~[Train][Data] Add heterogeneous cluster and S3 URL data loading benchmarks for training ingest~~ [Train][Data] Add S3 URL data loading benchmarks for training ingest Jan 23, 2026

xinyuangui2 and others added 4 commits January 23, 2026 21:30

remove some tests

da9c34a

Signed-off-by: xgui <xgui@anyscale.com>

fix testing command

4f307d3

Signed-off-by: xgui <xgui@anyscale.com>

Merge branch 'master' into add-autoscaling-release

d07600f

fix parameter

fe80dbc

Signed-off-by: xgui <xgui@anyscale.com>

[Train][Data] Add S3 URL data loading benchmarks for training ingest #60377

Are you sure you want to change the base?

[Train][Data] Add S3 URL data loading benchmarks for training ingest #60377

Conversation

xinyuangui2 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

S3 URL Data Loading

New Test Variations

Release test

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

srinathk10 commented Jan 22, 2026

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

xinyuangui2 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu commented Jan 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

justinvyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

xinyuangui2 commented Jan 23, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinyuangui2 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinyuangui2 commented Jan 21, 2026 •

edited

Loading

justinvyu left a comment •

edited

Loading