Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036

Merged
merged 8 commits into from
Apr 27, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Apr 18, 2023

Why are these changes needed?

Major changes

When we kill the GCS server process (pkill gcs_server) on the head Pod, a Raylet (?) will exit the process if it cannot connect to the GCS server after RAY_gcs_rpc_server_reconnect_timeout_s (code). By default, the value is 60s.

In #634, the RAY_gcs_rpc_server_reconnect_timeout_s values for both head and workers are set to 60 seconds. This means that the head and workers will crash almost simultaneously, 60 seconds after the GCS server process is terminated. Therefore, we need to ensure that the RAY_gcs_rpc_server_reconnect_timeout_s for the workers is greater than the time it takes for a new GCS server process to become available after the old one is terminated. That is,

WORKER_RECONNECTION_TIMEOUT > HEAD_RECONNECTION_TIMEOUT + restart the head Pod + new GCS server process become ready

This PR injects RAY_gcs_rpc_server_reconnect_timeout_s env variable to workers with default value 600s if GCS FT is enabled. Hence, head will crash 60 seconds after the process is terminated, and the new GCS server process will be ready before 600 seconds. Hence, the workers will not crash.

Minor changes

Related issue number

Closes #634

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(
# Build the KubeRay image (controller:latest)
helm install kuberay-operator kuberay/kuberay-operator --version 0.5.0 --set image.repository=controller,image.tag=latest

# Create a RayCluster with GCS FT
kubectl apply -f ray-cluster.external-redis.yaml

# This env var should not be set on head Pod by default
kubectl describe pod $YOUR_HEAD_POD | grep RAY_gcs_rpc_server_reconnect_timeout_s

# The env should be injected to worker Pods by default.
kubectl describe pod $YOUR_WORKER_POD | grep RAY_gcs_rpc_server_reconnect_timeout_s
# RAY_gcs_rpc_server_reconnect_timeout_s:  600

# Kill the GCS server process on the head Pod
kubectl exec -it $YOUR_HEAD_POD -- pkill gcs_server

# Expected behavior: head Pod will crash after 60 seconds, and workers will not crash.

Screen Shot 2023-04-19 at 1 15 19 PM

Screen Shot 2023-04-19 at 1 23 27 PM

@kevin85421 kevin85421 changed the title [WIP] [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed Apr 19, 2023
@kevin85421 kevin85421 marked this pull request as ready for review April 19, 2023 21:37
@kevin85421 kevin85421 changed the title [Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed Apr 19, 2023
@architkulkarni architkulkarni requested a review from fishbone April 19, 2023 22:32
@architkulkarni
Copy link
Contributor

@iycheng do you mind just checking the PR description and confirming that the solution makes sense from the Ray side?

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@@ -88,7 +89,8 @@ const (
RAYCLUSTER_DEFAULT_REQUEUE_SECONDS = 300

// Ray core default configurations
DefaultRedisPassword = "5241590000000000"
DefaultRedisPassword = "5241590000000000"
DefaultWorkerRayGcsReconnectTimeoutS = "600"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, or somewhere else in the code, can we add a brief comment explaining why we picked this value? It's explained well in the PR description but it's helpful to have it in a code comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in d0991ff

@@ -92,6 +92,8 @@ spec:
# RAY_REDIS_ADDRESS can force ray to use external redis
- name: RAY_REDIS_ADDRESS
value: redis:6379
- name: RAY_gcs_rpc_server_reconnect_timeout_s
value: "20"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to check this value somewhere? I didn't see it in the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to accelerate the test. Some tests in compatibility-test.py will kill the GCS server process on the head Pod and wait until the new one is available. By default, the head will crash after 60 seconds. With this configuration, it will crash after 20 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. This is another place where a code comment would be helpful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in d0991ff

if len(envVars) == 0 {
return false
}

for _, env := range envVars {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt this break if envVars slice is nil? Previously len(envVars) == 0 protected from that if i am not wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the slice is empty, it will not enter the for loop.

// You can edit this code!
// Click here and start typing.
package main

import "fmt"

func envVarExists(fruitName string, fruits []string) bool {
	for _, fruit := range fruits {
                fmt.Println("Enter the loop")
		if fruitName == fruit {
			return true
		}
	}
	return false
}

func main() {
	fruits := []string{"apple", "banana", "cherry", "date"}
	fmt.Println(envVarExists("banana", fruits))     // true
	fmt.Println(envVarExists("aaaaaa", fruits))     // false
	fmt.Println(envVarExists("aaaaaa", []string{})) // false
}

# STDOUT
Enter the loop
Enter the loop
true
Enter the loop
Enter the loop
Enter the loop
Enter the loop
false
false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try this on https://go.dev/play/

@kevin85421
Copy link
Member Author

@architkulkarni would you mind taking a look at this Ci failure (https://github.com/ray-project/kuberay/actions/runs/4758810193/jobs/8457448052?pr=1036)? Thanks! This seems to be my second time to see the error message. Thanks!

@kevin85421
Copy link
Member Author

@wilsonwang371 @iycheng would you mind taking a look at this PR? You only need to take a look at pod.go and pod_test.go. It may take 10 mins to review. Thanks!

// RAY_GCS_RPC_SERVER_RECONNECT_TIMEOUT_S to 600s. If the worker cannot reconnect to GCS within
// 600s, the Raylet will exit the process. By default, the value is 60s, so the head node will
// crash if the GCS server is down for more than 60s. Typically, the new GCS server will be available
// in 120 seconds, so we set the timeout to 600s to avoid the worker nodes crashing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to set it 600? Will 120 or similar values work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120 seconds works in my local Kind cluster, but it may not work in a production Kubernetes cluster. Users can easily change the value by setting an environment variable. In addition, it is possible for some Pods on public cloud to experience network disconnections for several minutes. That's why I decided to select a higher timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping @wilsonwang371

Does this make sense to you?

@kevin85421
Copy link
Member Author

Merge this PR because this blocks #1055. Feel free to add more comments on this PR, and I will open another PR to address the comments.

@kevin85421 kevin85421 merged commit 2019b4b into ray-project:master Apr 27, 2023
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…pod is killed (ray-project#1036)

Worker pods crash unexpectedly when gcs_server on head pod is killed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Worker pods crash unexpectedly when gcs_server on head pod is killed
4 participants