Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][GCS FT] Two alive head nodes will be present after the original head node recovers from a failure #39122

Closed
Tracked by #1033
kevin85421 opened this issue Aug 30, 2023 · 4 comments · Fixed by #40838
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-gcs Ray core global control storage. P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@kevin85421
Copy link
Member

kevin85421 commented Aug 30, 2023

What happened + What you expected to happen

KubeRay has several tests for GCS fault tolerance, and I observed that there will be 2 alive head Pods for ~10 seconds in the Ray cluster after the head node recovers from a failure. Therefore, using ray list nodes might show more than 1 "ALIVE" head nodes in the cluster temporarily. This may lead to an issue where the Serve controller believes it hasn't been scheduled to the head node, and as a result, it raises an exception. To avoid this issue, I added a workaround retry logic in test_ray_serve_1 to wait until only 1 head node is alive.

Screen Shot 2023-08-30 at 11 57 17 AM

Versions / Dependencies

2.6.3 & nightly

I believe this is not a new behavior.

Reproduction script

See ray-project/kuberay#1364 for more details.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@kevin85421 kevin85421 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue core Issues that should be addressed in Ray Core core-gcs Ray core global control storage. P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 30, 2023
@kevin85421
Copy link
Member Author

cc @iycheng @edoakes

@rkooo567
Copy link
Contributor

cc @vitsai has this been fixed?

@vitsai
Copy link
Contributor

vitsai commented Oct 25, 2023

This one is not a release blocker, right?

@rkooo567
Copy link
Contributor

Oh I thought it was fixed by the cluster id thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-gcs Ray core global control storage. P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants