-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Train][Data] Add S3 URL data loading benchmarks for training ingest #60377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Train][Data] Add S3 URL data loading benchmarks for training ingest #60377
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds new release tests for training ingest benchmarks on heterogeneous clusters, with both fixed-size and autoscaling configurations. The changes include a new test definition in release/release_tests.yaml and two corresponding cluster compute configuration files. The configurations seem correct for testing heterogeneous setups. My main feedback is on the new test definition, which has significant duplication in the script commands across its variations. I've suggested a refactoring to improve maintainability.
srinathk10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@justinvyu Could you pl also take a pass at this? |
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
release/train_tests/benchmark/image_classification/s3_url/factory.py
Outdated
Show resolved
Hide resolved
| for s3_url, wnid in zip(paths, classes): | ||
| # Parse S3 URL: s3://bucket/key | ||
| url_path = s3_url[5:] if s3_url.startswith("s3://") else s3_url | ||
| bucket, key = url_path.split("/", 1) | ||
|
|
||
| # Download image from S3 | ||
| response = s3_client.get_object(Bucket=bucket, Key=key) | ||
| data = response["Body"].read() | ||
|
|
||
| # Decode and transform image | ||
| image_pil = Image.open(io.BytesIO(data)).convert("RGB") | ||
| image_tensor = pil_to_tensor(image_pil) / 255.0 | ||
| processed_image = np.array(transform(image_tensor)) | ||
| processed_images.append(processed_image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be limiting throughput if we need to download one image at a time and then preprocess sequentially. is there an easy way to pipeline the downloading and transforming? can also do as a followup if needed, I just noticed the throughput is very low
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the low throughput is expected because downloading images inside map is low-efficient itself. I will add a note here.
release/train_tests/benchmark/compute_configs/heterogenous_autoscaling_gpu_4x4_aws.yaml
Outdated
Show resolved
Hide resolved
release/train_tests/benchmark/compute_configs/heterogenous_autoscaling_gpu_4x4_aws.yaml
Outdated
Show resolved
Hide resolved
|
Also, can you add some discussion about the results (are things as expected?) in the PR description? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this download expression that Ray Data has instead of implementing our own downloading? This one should handle async fetching and streaming. cc @goutamvenkat-anyscale
https://docs.ray.io/en/master/data/transforming-data.html#expressions-alpha
https://docs.ray.io/en/latest/data/api/doc/ray.data.expressions.download.html#ray.data.expressions.download
This should also remove the need for the hardcoded DEFAULT_MAP_BATCHES_BATCH_SIZE which we don't want users to have to set.
The performance should be within some % of the read_parquet/read_images performance. The initial "url" dataset shouldn't add that much overhead. Let's get this skip_training.s3_url number to something within ~10% of the other variants. We can also split up this PR to add the heterogeneous cluster setups first.
Ok I will split this PR into 2. |
Add a new data loading approach that: 1. Lists JPEG files from S3 using boto3 2. Creates a Ray dataset from the file records 3. Uses map_batches to download and process images from S3 NOTE: This implementation downloads images sequentially within each map_batches call. While concurrent downloads could improve throughput, this risks spawning too many threads when combined with Ray's parallelism. For production workloads, consider using Ray Data's native S3 reading. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
defc55e to
5695dc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
|
@justinvyu Updated. With the transform, the throughput is much faster than naively download. But it is still slower than read_parquet method. |
Summary
s3_urldata format that lists JPEG files from S3 and downloads images viamap_batchesChanges
S3 URL Data Loading
s3_urlimage format that:s3://anyscale-imagenet/ILSVRC/Data/CLS-LOCusing boto3map_batchesto download and process images from S3 URLsNew Test Variations
s3_urlvariations to existingtraining_ingest_benchmark-task=image_classificationtestRelease test
Test plan
name:^training_ingest_benchmark-task=image_classification\.skip_training\.s3_url$