You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
kevin85421 opened this issue
Aug 30, 2023
· 4 comments
· Fixed by #40838
Assignees
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-gcsRay core global control storage.P1Issue that should be fixed within a few weeksserveRay Serve Related Issue
KubeRay has several tests for GCS fault tolerance, and I observed that there will be 2 alive head Pods for ~10 seconds in the Ray cluster after the head node recovers from a failure. Therefore, using ray list nodes might show more than 1 "ALIVE" head nodes in the cluster temporarily. This may lead to an issue where the Serve controller believes it hasn't been scheduled to the head node, and as a result, it raises an exception. To avoid this issue, I added a workaround retry logic in test_ray_serve_1 to wait until only 1 head node is alive.
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
kevin85421
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
serve
Ray Serve Related Issue
core
Issues that should be addressed in Ray Core
core-gcs
Ray core global control storage.
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 30, 2023
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-gcsRay core global control storage.P1Issue that should be fixed within a few weeksserveRay Serve Related Issue
What happened + What you expected to happen
KubeRay has several tests for GCS fault tolerance, and I observed that there will be 2 alive head Pods for ~10 seconds in the Ray cluster after the head node recovers from a failure. Therefore, using
ray list nodes
might show more than 1 "ALIVE" head nodes in the cluster temporarily. This may lead to an issue where the Serve controller believes it hasn't been scheduled to the head node, and as a result, it raises an exception. To avoid this issue, I added a workaround retry logic intest_ray_serve_1
to wait until only 1 head node is alive.Versions / Dependencies
2.6.3 & nightly
I believe this is not a new behavior.
Reproduction script
See ray-project/kuberay#1364 for more details.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: