[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

fishbone · 2022-05-24T02:10:49Z

Why are these changes needed?

In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.

This is the basic cases for serve HA. We'll add more once we get better integrations.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

fishbone · 2022-05-25T05:20:39Z

This PR depends on #25131

ericl

Should this be using the fake autoscaler infra instead of docker https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html ?

fishbone · 2022-05-25T06:44:43Z

@ericl I checked that one, multi-node test docker autoscaler. This one is inspired from that one.

The thing here is whether the node should be killed or started should be controlled by the test, not the autoscaler. For example, we need to kill (autoscaler one also has that) and also restart the head node.

Another thing is that, autoscaler requires the head node to be alive (maybe I'm wrong) and this is to test serve is working even the head node is down.

I think in the long term, we should also have another e2e test in kuberay if it's integrated there.

scv119 · 2022-05-26T01:02:14Z

python/ray/tests/test_gcs_ha_e2e.py

+
+
+scripts = """
+import ray


nit: is it possible/worth to put the script into a separate proper type checked python file and load it into here?

scv119 · 2022-05-26T01:08:00Z

Also to eric's point; we see if can merge this with https://docs.ray.io/en/latest/ray-contribute/fake-autoscaler.html#using-ray-autoscaler-private-fake-multi-node-test-utils-dockercluster so we have a single way to run docker related tests.

fishbone · 2022-05-26T04:16:11Z

@scv119 I don't think it's working. You can refer to this for my response.

For this test, we need the control of remove/add nodes. There it's controlled by the autoscaler.

Let me know if you have any new thoughts of this.

scv119 · 2022-05-26T05:04:28Z

The failed tests looks related. https://buildkite.com/ray-project/ray-builders-pr/builds/33416#0180fdd4-26d4-4525-b326-fd6f1b396cdc

Otherwise it's good to go!

python/ray/tests/test_gcs_ha_e2e.py

fishbone · 2022-06-01T01:09:51Z

It looks like when using bazel test it won't work due to file not found error. I'm investigating it.

fishbone added 3 commits May 24, 2022 05:33

fix

4437be1

fix

17a0eff

up

9a0d625

fishbone force-pushed the gcs-ha-e2e branch from b51265a to 9a0d625 Compare May 24, 2022 21:11

fishbone added 10 commits May 24, 2022 21:11

add e2e

4133281

fix

68868da

fix

1bae1bf

Merge remote-tracking branch 'upstream/master' into gcs-ha-e2e

65dd9e6

make test ready

55098e3

update

5b64636

fix comment

c996b6b

Merge branch 'actor-migrates' into gcs-ha-e2e

57f93c9

up

fb2d95c

up

94bddee

fishbone marked this pull request as ready for review May 25, 2022 05:14

fishbone changed the title ~~Gcs ha e2e~~ [core] Basic end-2-end multi-node tests for GCS HA. May 25, 2022

add docker build

6dd430a

fishbone assigned ericl, scv119 and mwtian May 25, 2022

fishbone changed the title ~~[core] Basic end-2-end multi-node tests for GCS HA.~~ [core] Basic end-2-end multi-node tests for GCS HA in CI. May 25, 2022

ericl reviewed May 25, 2022

View reviewed changes

fishbone added 3 commits May 26, 2022 00:14

fix

9e3688e

Merge remote-tracking branch 'upstream/master' into gcs-ha-e2e

e45cd4e

lint

215f157

scv119 reviewed May 26, 2022

View reviewed changes

mwtian reviewed May 26, 2022

View reviewed changes

python/ray/tests/test_gcs_ha_e2e.py Show resolved Hide resolved

fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 26, 2022

fishbone added 4 commits May 26, 2022 23:32

Merge remote-tracking branch 'upstream/master' into gcs-ha-e2e

1355094

fix comment

a3cde55

format

4541444

Merge remote-tracking branch 'upstream/master' into gcs-ha-e2e

1cef3d5

fishbone added 2 commits June 1, 2022 02:26

fix

e84c0b9

Merge remote-tracking branch 'upstream/master' into gcs-ha-e2e

4a148fc

fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 1, 2022

ericl removed their assignment Jun 1, 2022

scv119 approved these changes Jun 2, 2022

View reviewed changes

fishbone merged commit cb1f08a into ray-project:master Jun 2, 2022

fishbone deleted the gcs-ha-e2e branch June 2, 2022 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

fishbone commented May 24, 2022 •

edited

Loading

fishbone commented May 25, 2022

ericl left a comment

fishbone commented May 25, 2022 •

edited

Loading

scv119 May 26, 2022 •

edited

Loading

scv119 commented May 26, 2022

fishbone commented May 26, 2022

scv119 commented May 26, 2022

fishbone commented Jun 1, 2022



		scripts = """
		import ray

[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

[core] Basic end-2-end multi-node tests for GCS HA in CI. #25114

Conversation

fishbone commented May 24, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

fishbone commented May 25, 2022

ericl left a comment

Choose a reason for hiding this comment

fishbone commented May 25, 2022 • edited Loading

scv119 May 26, 2022 • edited Loading

Choose a reason for hiding this comment

scv119 commented May 26, 2022

fishbone commented May 26, 2022

scv119 commented May 26, 2022

fishbone commented Jun 1, 2022

fishbone commented May 24, 2022 •

edited

Loading

fishbone commented May 25, 2022 •

edited

Loading

scv119 May 26, 2022 •

edited

Loading