[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

kevin85421 · 2023-08-30T19:05:40Z

What happened + What you expected to happen

KubeRay has several tests for GCS fault tolerance, and I observed that there will be 2 alive head Pods for ~10 seconds in the Ray cluster after the head node recovers from a failure. Therefore, using ray list nodes might show more than 1 "ALIVE" head nodes in the cluster temporarily. This may lead to an issue where the Serve controller believes it hasn't been scheduled to the head node, and as a result, it raises an exception. To avoid this issue, I added a workaround retry logic in test_ray_serve_1 to wait until only 1 head node is alive.

Versions / Dependencies

2.6.3 & nightly

I believe this is not a new behavior.

Reproduction script

See ray-project/kuberay#1364 for more details.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

kevin85421 · 2023-08-30T19:06:47Z

cc @iycheng @edoakes

rkooo567 · 2023-10-25T22:33:01Z

cc @vitsai has this been fixed?

vitsai · 2023-10-25T22:53:28Z

This one is not a release blocker, right?

rkooo567 · 2023-10-26T05:25:52Z

Oh I thought it was fixed by the cluster id thing?

kevin85421 mentioned this issue Aug 30, 2023

[Umbrella] GCS fault tolerance on KubeRay ray-project/kuberay#1033

Open

22 tasks

jjyao assigned jonathan-anyscale Sep 9, 2023

jonathan-anyscale mentioned this issue Oct 31, 2023

[gcs ft] Mark old head node as dead after pod restart #40838

Merged

8 tasks

rkooo567 closed this as completed in #40838 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

kevin85421 commented Aug 30, 2023 •

edited

Loading

kevin85421 commented Aug 30, 2023

rkooo567 commented Oct 25, 2023

vitsai commented Oct 25, 2023

rkooo567 commented Oct 26, 2023

[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

Comments

kevin85421 commented Aug 30, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

kevin85421 commented Aug 30, 2023

rkooo567 commented Oct 25, 2023

vitsai commented Oct 25, 2023

rkooo567 commented Oct 26, 2023

kevin85421 commented Aug 30, 2023 •

edited

Loading