Skip to content

Conversation

@xinyuangui2
Copy link
Contributor

Summary

  • Add heterogeneous cluster configurations with dedicated CPU worker nodes for Ray Data operations alongside GPU worker nodes for training
  • Include both fixed-size and autoscaling variants for benchmarking different scaling behaviors

Cluster Configurations

Fixed-size (heterogenous_fixed_size_gpu_4x4_aws.yaml):

  • 4x GPU workers (g4dn.12xlarge) - always running
  • 4x CPU workers (r6i.4xlarge) - always running

Autoscaling (heterogenous_autoscaling_gpu_4x4_aws.yaml):

  • 4x GPU workers (g4dn.12xlarge) - always running
  • 0-4x CPU workers (r6i.4xlarge) - scales based on demand

Release test

Test Global throughput
skip_training.parquet 9717
heterogenous_fixed_size.skip_training.parquet 10893
heterogenous_autoscaling.skip_training.parquet 9986

Add fixed-size and autoscaling configurations with CPU worker nodes
for data loading alongside GPU worker nodes for training.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces new benchmark configurations for heterogeneous clusters, including fixed-size and autoscaling variants. These new configurations will be valuable for benchmarking Ray Data operations alongside GPU worker nodes for training. However, there is a mismatch between the number of workers requested in the benchmark script and the maximum number of workers defined in the new cluster compute configurations.

@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 23, 2026
@xinyuangui2 xinyuangui2 requested a review from justinvyu January 23, 2026 21:25
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, need to update these since this landed: #60414

@ray-gardener ray-gardener bot added train Ray Train Related Issue release-test release test labels Jan 24, 2026
xinyuangui2 and others added 2 commits January 24, 2026 09:43
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants