Skip to content

Conversation

Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 8, 2025

Why are these changes needed?

Created a nightly release test that tests data workflow with and without rpc fault injection. Have a gce and aws variant of the test.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

dayshah and others added 2 commits October 7, 2025 15:55
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Oct 8, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 marked this pull request as ready for review October 10, 2025 06:44
@Sparks0219 Sparks0219 requested review from dayshah and jjyao October 10, 2025 06:44
@Sparks0219
Copy link
Contributor Author

Sparks0219 commented Oct 10, 2025

Not sure what the difference is between "aws" and "aws_perf" in the env section in the yaml.
Also picked n2-standard-8 since it's the equivalent of m5.2xlarge in gce, but it does seem to take a bit longer (5 min) than the aws variant. https://buildkite.com/ray-project/release/builds/62845#0199ccab-5b19-48da-8e64-ac2575ff5889

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core release-test release test labels Oct 10, 2025
@dayshah dayshah requested review from a team and Copilot October 10, 2025 16:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds nightly release tests to model data workflow for cross-availability zone RPC fault tolerance testing. The tests evaluate Ray's behavior with and without RPC failure injection to ensure robustness in distributed environments.

Key changes:

  • Added two new nightly tests with GCE and AWS variants for cross-AZ RPC testing
  • Enhanced map_benchmark.py to support configurable concurrency parameters
  • Created cluster configuration files for both GCE and AWS environments

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
release/release_data_tests.yaml Adds two nightly tests for cross-AZ RPC fault tolerance with and without failure injection
release/nightly_tests/dataset/map_benchmark.py Adds configurable concurrency parameter and refactors map_batches logic
release/nightly_tests/dataset/cross_az_250_350_compute_gce.yaml GCE cluster configuration for cross-AZ testing with 250-350 workers
release/nightly_tests/dataset/cross_az_250_350_compute_aws.yaml AWS cluster configuration for cross-AZ testing with 250-350 workers

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"output of the first run as input."
),
)
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changes in this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

larger concurrency is needed for this workload

@Sparks0219 Sparks0219 changed the title [core] Nightly release test for nubank [core] Nightly release test with cross AZ fault injection Oct 10, 2025
@jjyao
Copy link
Collaborator

jjyao commented Oct 10, 2025

cc @alexeykudinkin and @bveeramani for a review from the data side.

Signed-off-by: joshlee <joshlee@anyscale.com>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: joshlee <joshlee@anyscale.com>
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @Sparks0219

Next let's build some ray core cross-az tests as well :)

min_workers: 250
max_workers: 350
flags:
enable_multi_az_serve: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is not relevant, it's for ray serve workloads

Copy link
Contributor

@dayshah dayshah Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually the hack we use to spread cross az, just setting allow won't actually spread, it'll just "allow"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting 🤔

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants