Skip to content

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Jan 21, 2026

Summary

  • Add new s3_url data format that lists JPEG files from S3 and downloads images via map_batches

Changes

S3 URL Data Loading

  • New s3_url image format that:
    1. Lists JPEG files from s3://anyscale-imagenet/ILSVRC/Data/CLS-LOC using boto3
    2. Creates a Ray dataset from file records (path, class)
    3. Uses map_batches to download and process images from S3 URLs
  • This separates file listing from image downloading, enabling parallel downloads on CPU workers

New Test Variations

  • Added s3_url variations to existing training_ingest_benchmark-task=image_classification test

Release test

Test Global throughput
skip_training.parquet 9717
skip_training.s3_url 5695

Test plan

  • Run name:^training_ingest_benchmark-task=image_classification\.skip_training\.s3_url$

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new release tests for training ingest benchmarks on heterogeneous clusters, with both fixed-size and autoscaling configurations. The changes include a new test definition in release/release_tests.yaml and two corresponding cluster compute configuration files. The configurations seem correct for testing heterogeneous setups. My main feedback is on the new test definition, which has significant duplication in the script commands across its variations. I've suggested a refactoring to improve maintainability.

@xinyuangui2 xinyuangui2 changed the title add autoscaling and heterogenous add autoscaling and heterogenous + s3 url data factory Jan 21, 2026
@xinyuangui2 xinyuangui2 changed the title add autoscaling and heterogenous + s3 url data factory [Train][Data] Add heterogeneous cluster and S3 URL data loading benchmarks for training ingest Jan 21, 2026
@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 22, 2026
@xinyuangui2 xinyuangui2 marked this pull request as ready for review January 22, 2026 18:10
@ray-gardener ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues release-test release test labels Jan 22, 2026
Copy link
Contributor

@srinathk10 srinathk10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srinathk10
Copy link
Contributor

@justinvyu Could you pl also take a pass at this?

@srinathk10 srinathk10 requested a review from justinvyu January 22, 2026 21:25
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 197 to 222
for s3_url, wnid in zip(paths, classes):
# Parse S3 URL: s3://bucket/key
url_path = s3_url[5:] if s3_url.startswith("s3://") else s3_url
bucket, key = url_path.split("/", 1)

# Download image from S3
response = s3_client.get_object(Bucket=bucket, Key=key)
data = response["Body"].read()

# Decode and transform image
image_pil = Image.open(io.BytesIO(data)).convert("RGB")
image_tensor = pil_to_tensor(image_pil) / 255.0
processed_image = np.array(transform(image_tensor))
processed_images.append(processed_image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be limiting throughput if we need to download one image at a time and then preprocess sequentially. is there an easy way to pipeline the downloading and transforming? can also do as a followup if needed, I just noticed the throughput is very low

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the low throughput is expected because downloading images inside map is low-efficient itself. I will add a note here.

@justinvyu
Copy link
Contributor

Also, can you add some discussion about the results (are things as expected?) in the PR description?

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this download expression that Ray Data has instead of implementing our own downloading? This one should handle async fetching and streaming. cc @goutamvenkat-anyscale

https://docs.ray.io/en/master/data/transforming-data.html#expressions-alpha
https://docs.ray.io/en/latest/data/api/doc/ray.data.expressions.download.html#ray.data.expressions.download

This should also remove the need for the hardcoded DEFAULT_MAP_BATCHES_BATCH_SIZE which we don't want users to have to set.

The performance should be within some % of the read_parquet/read_images performance. The initial "url" dataset shouldn't add that much overhead. Let's get this skip_training.s3_url number to something within ~10% of the other variants. We can also split up this PR to add the heterogeneous cluster setups first.

cursor[bot]

This comment was marked as outdated.

@xinyuangui2
Copy link
Contributor Author

Can we use this download expression that Ray Data has instead of implementing our own downloading? This one should handle async fetching and streaming. cc @goutamvenkat-anyscale

https://docs.ray.io/en/master/data/transforming-data.html#expressions-alpha https://docs.ray.io/en/latest/data/api/doc/ray.data.expressions.download.html#ray.data.expressions.download

This should also remove the need for the hardcoded DEFAULT_MAP_BATCHES_BATCH_SIZE which we don't want users to have to set.

The performance should be within some % of the read_parquet/read_images performance. The initial "url" dataset shouldn't add that much overhead. Let's get this skip_training.s3_url number to something within ~10% of the other variants. We can also split up this PR to add the heterogeneous cluster setups first.

Ok I will split this PR into 2.

Add a new data loading approach that:
1. Lists JPEG files from S3 using boto3
2. Creates a Ray dataset from the file records
3. Uses map_batches to download and process images from S3

NOTE: This implementation downloads images sequentially within each
map_batches call. While concurrent downloads could improve throughput,
this risks spawning too many threads when combined with Ray's parallelism.
For production workloads, consider using Ray Data's native S3 reading.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@xinyuangui2 xinyuangui2 force-pushed the add-autoscaling-release branch from defc55e to 5695dc7 Compare January 23, 2026 20:25
xinyuangui2 and others added 2 commits January 23, 2026 20:45
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title [Train][Data] Add heterogeneous cluster and S3 URL data loading benchmarks for training ingest [Train][Data] Add S3 URL data loading benchmarks for training ingest Jan 23, 2026
xinyuangui2 and others added 4 commits January 23, 2026 21:30
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2
Copy link
Contributor Author

@justinvyu Updated. With the transform, the throughput is much faster than naively download. But it is still slower than read_parquet method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants