Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

Merged
merged 23 commits into from
Jun 2, 2022

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented May 24, 2022

Why are these changes needed?

In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

  • It starts a head node (0 cpus) and a worker node
  • It tried the basic function and make sure it's working
  • It kills GCS and make sure everything is working.
  • It starts GCS and make sure reconfig continues to work.

This is the basic cases for serve HA. We'll add more once we get better integrations.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone marked this pull request as ready for review May 25, 2022 05:14
@fishbone fishbone changed the title Gcs ha e2e [core] Basic end-2-end multi-node tests for GCS HA. May 25, 2022
@fishbone
Copy link
Contributor Author

This PR depends on #25131

@fishbone fishbone changed the title [core] Basic end-2-end multi-node tests for GCS HA. [core] Basic end-2-end multi-node tests for GCS HA in CI. May 25, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using the fake autoscaler infra instead of docker https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html ?

@fishbone
Copy link
Contributor Author

fishbone commented May 25, 2022

@ericl I checked that one, multi-node test docker autoscaler. This one is inspired from that one.

The thing here is whether the node should be killed or started should be controlled by the test, not the autoscaler. For example, we need to kill (autoscaler one also has that) and also restart the head node.

Another thing is that, autoscaler requires the head node to be alive (maybe I'm wrong) and this is to test serve is working even the head node is down.

I think in the long term, we should also have another e2e test in kuberay if it's integrated there.



scripts = """
import ray
Copy link
Contributor

@scv119 scv119 May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible/worth to put the script into a separate proper type checked python file and load it into here?

@scv119
Copy link
Contributor

scv119 commented May 26, 2022

Also to eric's point; we see if can merge this with https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html#using-ray-autoscaler-private-fake-multi-node-test-utils-dockercluster so we have a single way to run docker related tests.

@fishbone
Copy link
Contributor Author

@scv119 I don't think it's working. You can refer to this for my response.

For this test, we need the control of remove/add nodes. There it's controlled by the autoscaler.

Let me know if you have any new thoughts of this.

@scv119
Copy link
Contributor

scv119 commented May 26, 2022

The failed tests looks related. https://buildkite.com/ray-project/ray-builders-pr/builds/33416#0180fdd4-26d4-4525-b326-fd6f1b396cdc

Otherwise it's good to go!

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 26, 2022
@fishbone
Copy link
Contributor Author

fishbone commented Jun 1, 2022

It looks like when using bazel test it won't work due to file not found error. I'm investigating it.

@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 1, 2022
@ericl ericl removed their assignment Jun 1, 2022
@fishbone fishbone merged commit cb1f08a into ray-project:master Jun 2, 2022
@fishbone fishbone deleted the gcs-ha-e2e branch June 2, 2022 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants