-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114
Conversation
This PR depends on #25131 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be using the fake autoscaler infra instead of docker https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html ?
@ericl I checked that one, multi-node test docker autoscaler. This one is inspired from that one. The thing here is whether the node should be killed or started should be controlled by the test, not the autoscaler. For example, we need to kill (autoscaler one also has that) and also restart the head node. Another thing is that, autoscaler requires the head node to be alive (maybe I'm wrong) and this is to test serve is working even the head node is down. I think in the long term, we should also have another e2e test in kuberay if it's integrated there. |
|
||
|
||
scripts = """ | ||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible/worth to put the script into a separate proper type checked python file and load it into here?
Also to eric's point; we see if can merge this with https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html#using-ray-autoscaler-private-fake-multi-node-test-utils-dockercluster so we have a single way to run docker related tests. |
The failed tests looks related. https://buildkite.com/ray-project/ray-builders-pr/builds/33416#0180fdd4-26d4-4525-b326-fd6f1b396cdc Otherwise it's good to go! |
It looks like when using bazel test it won't work due to file not found error. I'm investigating it. |
Why are these changes needed?
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.
To make it close to the real-world case, the docker is used for isolation:
This is the basic cases for serve HA. We'll add more once we get better integrations.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.