[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

kevin85421 · 2023-04-18T23:47:16Z

Why are these changes needed?

Major changes

When we kill the GCS server process (pkill gcs_server) on the head Pod, a Raylet (?) will exit the process if it cannot connect to the GCS server after RAY_gcs_rpc_server_reconnect_timeout_s (code). By default, the value is 60s.

In #634, the RAY_gcs_rpc_server_reconnect_timeout_s values for both head and workers are set to 60 seconds. This means that the head and workers will crash almost simultaneously, 60 seconds after the GCS server process is terminated. Therefore, we need to ensure that the RAY_gcs_rpc_server_reconnect_timeout_s for the workers is greater than the time it takes for a new GCS server process to become available after the old one is terminated. That is,

WORKER_RECONNECTION_TIMEOUT > HEAD_RECONNECTION_TIMEOUT + restart the head Pod + new GCS server process become ready

This PR injects RAY_gcs_rpc_server_reconnect_timeout_s env variable to workers with default value 600s if GCS FT is enabled. Hence, head will crash 60 seconds after the process is terminated, and the new GCS server process will be ready before 600 seconds. Hence, the workers will not crash.

Minor changes

Remove rayCluster label: The label has been removed by Revise sample configs, increase memory requests, update Ray versions #761 from sample YAML files. This PR removes the label from unit tests.
compatibility-test.py: This is just for refactoring.
ray-cluster.ray-ft.yaml.template: Set RAY_gcs_rpc_server_reconnect_timeout_s to make the compatibility test become faster.

Related issue number

Closes #634

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

# Build the KubeRay image (controller:latest)
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Create a RayCluster with GCS FT
kubectl apply -f ray-cluster.external-redis.yaml

# This env var should not be set on head Pod by default
kubectl describe pod $YOUR_HEAD_POD | grep RAY_gcs_rpc_server_reconnect_timeout_s

# The env should be injected to worker Pods by default.
kubectl describe pod $YOUR_WORKER_POD | grep RAY_gcs_rpc_server_reconnect_timeout_s
# RAY_gcs_rpc_server_reconnect_timeout_s:  600

# Kill the GCS server process on the head Pod
kubectl exec -it $YOUR_HEAD_POD -- pkill gcs_server

# Expected behavior: head Pod will crash after 60 seconds, and workers will not crash.

architkulkarni · 2023-04-19T22:33:25Z

@iycheng do you mind just checking the PR description and confirming that the solution makes sense from the Ray side?

architkulkarni

Looks good to me

architkulkarni · 2023-04-19T22:44:22Z

ray-operator/controllers/ray/common/constant.go

@@ -88,7 +89,8 @@ const (
 	RAYCLUSTER_DEFAULT_REQUEUE_SECONDS      = 300

 	// Ray core default configurations
-	DefaultRedisPassword = "5241590000000000"
+	DefaultRedisPassword                 = "5241590000000000"
+	DefaultWorkerRayGcsReconnectTimeoutS = "600"


Here, or somewhere else in the code, can we add a brief comment explaining why we picked this value? It's explained well in the PR description but it's helpful to have it in a code comment

Added comment in d0991ff

architkulkarni · 2023-04-19T22:54:20Z

tests/config/ray-cluster.ray-ft.yaml.template

@@ -92,6 +92,8 @@ spec:
              # RAY_REDIS_ADDRESS can force ray to use external redis
              - name: RAY_REDIS_ADDRESS
                value: redis:6379
+              - name: RAY_gcs_rpc_server_reconnect_timeout_s
+                value: "20"


Do we intend to check this value somewhere? I didn't see it in the test

This is used to accelerate the test. Some tests in compatibility-test.py will kill the GCS server process on the head Pod and wait until the new one is available. By default, the head will crash after 60 seconds. With this configuration, it will crash after 20 seconds.

Oh right. This is another place where a code comment would be helpful

Added comment in d0991ff

gvspraveen · 2023-04-20T16:58:40Z

ray-operator/controllers/ray/common/pod.go

-	if len(envVars) == 0 {
-		return false
-	}
-
 	for _, env := range envVars {


doesnt this break if envVars slice is nil? Previously len(envVars) == 0 protected from that if i am not wrong.

If the slice is empty, it will not enter the for loop.

// You can edit this code! // Click here and start typing. package main import "fmt" func envVarExists(fruitName string, fruits []string) bool { for _, fruit := range fruits { fmt.Println("Enter the loop") if fruitName == fruit { return true } } return false } func main() { fruits := []string{"apple", "banana", "cherry", "date"} fmt.Println(envVarExists("banana", fruits)) // true fmt.Println(envVarExists("aaaaaa", fruits)) // false fmt.Println(envVarExists("aaaaaa", []string{})) // false } # STDOUT Enter the loop Enter the loop true Enter the loop Enter the loop Enter the loop Enter the loop false false

You can try this on https://go.dev/play/

kevin85421 · 2023-04-20T21:53:21Z

@architkulkarni would you mind taking a look at this Ci failure (https://github.com/ray-project/kuberay/actions/runs/4758810193/jobs/8457448052?pr=1036)? Thanks! This seems to be my second time to see the error message. Thanks!

kevin85421 · 2023-04-24T22:04:46Z

@wilsonwang371 @iycheng would you mind taking a look at this PR? You only need to take a look at pod.go and pod_test.go. It may take 10 mins to review. Thanks!

wilsonwang371 · 2023-04-25T00:00:47Z

ray-operator/controllers/ray/common/pod.go

+		// RAY_GCS_RPC_SERVER_RECONNECT_TIMEOUT_S to 600s. If the worker cannot reconnect to GCS within
+		// 600s, the Raylet will exit the process. By default, the value is 60s, so the head node will
+		// crash if the GCS server is down for more than 60s. Typically, the new GCS server will be available
+		// in 120 seconds, so we set the timeout to 600s to avoid the worker nodes crashing.


why do we need to set it 600? Will 120 or similar values work?

120 seconds works in my local Kind cluster, but it may not work in a production Kubernetes cluster. Users can easily change the value by setting an environment variable. In addition, it is possible for some Pods on public cloud to experience network disconnections for several minutes. That's why I decided to select a higher timeout.

Gentle ping @wilsonwang371

Does this make sense to you?

kevin85421 · 2023-04-27T18:11:58Z

Merge this PR because this blocks #1055. Feel free to add more comments on this PR, and I will open another PR to address the comments.

…pod is killed (ray-project#1036) Worker pods crash unexpectedly when gcs_server on head pod is killed

kevin85421 added 4 commits April 19, 2023 16:59

tmp

4c250d9

remove rayCluster label

de2010c

refactor compatibility test

8ec1cb2

update

a3573b9

kevin85421 force-pushed the fault-tolerance branch from 311d70f to a3573b9 Compare April 19, 2023 16:59

kevin85421 added 2 commits April 19, 2023 20:33

update

f62bcd1

fix lint errors

038c206

kevin85421 changed the title ~~[WIP]~~ [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed Apr 19, 2023

update

9e805d6

kevin85421 marked this pull request as ready for review April 19, 2023 21:37

kevin85421 requested review from architkulkarni, wilsonwang371, scarlet25151, gvspraveen, Jeffwan and DmitriGekhtman April 19, 2023 21:38

kevin85421 changed the title ~~[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed~~ [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed Apr 19, 2023

kevin85421 mentioned this pull request Apr 19, 2023

[Umbrella] GCS fault tolerance on KubeRay #1033

Open

22 tasks

architkulkarni requested a review from fishbone April 19, 2023 22:32

architkulkarni approved these changes Apr 19, 2023

View reviewed changes

gvspraveen reviewed Apr 20, 2023

View reviewed changes

update

d0991ff

architkulkarni mentioned this pull request Apr 20, 2023

[CI] RayJob Sample YAML test flaky with "cannot open file sample_code.py" #1041

Closed

wilsonwang371 reviewed Apr 25, 2023

View reviewed changes

kevin85421 merged commit 2019b4b into ray-project:master Apr 27, 2023

kevin85421 mentioned this pull request Apr 27, 2023

[Bug] compatibility test for the nightly Ray image fails #1055

Merged

4 tasks

kevin85421 mentioned this pull request May 15, 2023

KubeRay v0.6.0 roadmap #1052

Closed

kevin85421 mentioned this pull request Jun 5, 2023

If the head node dies, the cluster is never restored [Bug] #1141

Open

2 tasks

kevin85421 mentioned this pull request Aug 21, 2023

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

Merged

4 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head …

33b0bb9

…pod is killed (ray-project#1036) Worker pods crash unexpectedly when gcs_server on head pod is killed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

kevin85421 commented Apr 18, 2023 •

edited

Loading

architkulkarni commented Apr 19, 2023

architkulkarni left a comment

architkulkarni Apr 19, 2023

kevin85421 Apr 20, 2023

architkulkarni Apr 19, 2023

kevin85421 Apr 20, 2023

architkulkarni Apr 20, 2023

kevin85421 Apr 20, 2023

gvspraveen Apr 20, 2023

kevin85421 Apr 20, 2023

kevin85421 Apr 20, 2023

kevin85421 commented Apr 20, 2023

kevin85421 commented Apr 24, 2023

wilsonwang371 Apr 25, 2023

kevin85421 Apr 25, 2023

kevin85421 Apr 26, 2023

kevin85421 commented Apr 27, 2023

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

Conversation

kevin85421 commented Apr 18, 2023 • edited Loading

Why are these changes needed?

Major changes

Minor changes

Related issue number

Checks

architkulkarni commented Apr 19, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Apr 20, 2023

kevin85421 commented Apr 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Apr 27, 2023

kevin85421 commented Apr 18, 2023 •

edited

Loading