[RayService][Bug] Serve Service May Select Pods That Are Actually Unready for Serving Traffic #1856
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
In short, when using Rayservice, the Serve Service may select the head Pod even if the serve app has not been submitted or is not running. This PR fixes it.
Detailed Reason:
In order to support "Raycluster with serve”, the Serving logic is now below:
ray.io/serve=true
.ray.io/serve=true
regardless.ray.io/serve
accordingly. As for worker Pods, their traffic readiness is determined by the readiness probe. See PR 1808 for more details.As shown above , the Raycluster controller sets the label
ray.io/serve=true
regardless in order to support "Raycluster with serve". This causes the Serve Service to select the head Pod even if there is no running serve app or the HTTP proxy is not healthy. It will only get fixed after the Rayservice controller detects and changes the label toray.io/serve=false
.Fix:
This PR set
ray.io/serve=false
for head Pod when creating to avoid the problem.For people interested in "Raycluster with serve":
This PR removes the
ray.io/serve=true
selection label for the Serve Service created by 'ray cluster with serve'. This does not harm its functionality. As described in #1672, it exclusively uses Raycluster for serving. However, as mentioned above (see also PR 1808 for a better understanding), it is the Rayservice controller that checks the HTTP Proxy's health for Head Pod. Therefore, the 'Raycluster with serve' does not actually offer high availability. Even if an HTTP Proxy is unhealthy in the head Pod, the labelray.io/serve
will not change to false, so, there is no need to have this selection label in this case.Related issue number
Checks