Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Bug] Serve Service May Select Pods That Are Actually Unready for Serving Traffic #1856

Merged
merged 6 commits into from
Jan 25, 2024

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Jan 20, 2024

Why are these changes needed?

In short, when using Rayservice, the Serve Service may select the head Pod even if the serve app has not been submitted or is not running. This PR fixes it.

Detailed Reason:

In order to support "Raycluster with serve”, the Serving logic is now below:

  1. Serve Service is created and selects only the Pods that have the label ray.io/serve=true.
  2. The Raycluster controller creates Ray Pods with the label ray.io/serve=true regardless.
  3. The Rayservice controller continuously checks if head Pod has a healthy HTTP proxy to serve traffic and manage ray.io/serve accordingly. As for worker Pods, their traffic readiness is determined by the readiness probe. See PR 1808 for more details.

As shown above , the Raycluster controller sets the label ray.io/serve=true regardless in order to support "Raycluster with serve". This causes the Serve Service to select the head Pod even if there is no running serve app or the HTTP proxy is not healthy. It will only get fixed after the Rayservice controller detects and changes the label to ray.io/serve=false.

Fix:

This PR set ray.io/serve=false for head Pod when creating to avoid the problem.

For people interested in "Raycluster with serve":

This PR removes the ray.io/serve=true selection label for the Serve Service created by 'ray cluster with serve'. This does not harm its functionality. As described in #1672, it exclusively uses Raycluster for serving. However, as mentioned above (see also PR 1808 for a better understanding), it is the Rayservice controller that checks the HTTP Proxy's health for Head Pod. Therefore, the 'Raycluster with serve' does not actually offer high availability. Even if an HTTP Proxy is unhealthy in the head Pod, the label ray.io/serve will not change to false, so, there is no need to have this selection label in this case.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@kevin85421 kevin85421 self-requested a review January 22, 2024 17:37
@kevin85421 kevin85421 self-assigned this Jan 22, 2024
// TODO (kevin85421): In the current RayService implementation, we only add this label to a Pod after
// it passes the health check. The other option is to use the readiness probe to control it. This
// logic always add the label to the Pod no matter whether it is ready or not.
podTemplateSpec.Labels[utils.RayClusterServingServiceLabelKey] = utils.EnableRayClusterServingServiceTrue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discuss offline: Removing the label would result in the worker Pods lacking this label, preventing them from attaching to the K8s service managed by the RayService.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out! I have correct it in 6de9391. Below is the test:

kubectl apply -f /home/ubuntu/kuberay/ray-operator/config/samples/ray_v1alpha1_rayservice.yaml
kubectl describe rayservice | grep "Service Status"
# Service Status:  Running
kubectl describe svc rayservice-sample-serve-svc
# Endpoints:         10.244.0.6:8000,10.244.0.7:8000
kubectl describe $(kubectl get pods -o=name | grep worker) | grep "ray.io/serve"
# ray.io/serve=true
kubectl describe $(kubectl get pods -o=name | grep head) | grep "ray.io/serve"
# ray.io/serve=true

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@kevin85421 kevin85421 merged commit 2e8f532 into ray-project:master Jan 25, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants