-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed #1036
Conversation
311d70f
to
a3573b9
Compare
@iycheng do you mind just checking the PR description and confirming that the solution makes sense from the Ray side? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
@@ -88,7 +89,8 @@ const ( | |||
RAYCLUSTER_DEFAULT_REQUEUE_SECONDS = 300 | |||
|
|||
// Ray core default configurations | |||
DefaultRedisPassword = "5241590000000000" | |||
DefaultRedisPassword = "5241590000000000" | |||
DefaultWorkerRayGcsReconnectTimeoutS = "600" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, or somewhere else in the code, can we add a brief comment explaining why we picked this value? It's explained well in the PR description but it's helpful to have it in a code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in d0991ff
@@ -92,6 +92,8 @@ spec: | |||
# RAY_REDIS_ADDRESS can force ray to use external redis | |||
- name: RAY_REDIS_ADDRESS | |||
value: redis:6379 | |||
- name: RAY_gcs_rpc_server_reconnect_timeout_s | |||
value: "20" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to check this value somewhere? I didn't see it in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to accelerate the test. Some tests in compatibility-test.py
will kill the GCS server process on the head Pod and wait until the new one is available. By default, the head will crash after 60 seconds. With this configuration, it will crash after 20 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. This is another place where a code comment would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in d0991ff
if len(envVars) == 0 { | ||
return false | ||
} | ||
|
||
for _, env := range envVars { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt this break if envVars slice is nil? Previously len(envVars) == 0
protected from that if i am not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the slice is empty, it will not enter the for loop.
// You can edit this code!
// Click here and start typing.
package main
import "fmt"
func envVarExists(fruitName string, fruits []string) bool {
for _, fruit := range fruits {
fmt.Println("Enter the loop")
if fruitName == fruit {
return true
}
}
return false
}
func main() {
fruits := []string{"apple", "banana", "cherry", "date"}
fmt.Println(envVarExists("banana", fruits)) // true
fmt.Println(envVarExists("aaaaaa", fruits)) // false
fmt.Println(envVarExists("aaaaaa", []string{})) // false
}
# STDOUT
Enter the loop
Enter the loop
true
Enter the loop
Enter the loop
Enter the loop
Enter the loop
false
false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try this on https://go.dev/play/
@architkulkarni would you mind taking a look at this Ci failure (https://github.com/ray-project/kuberay/actions/runs/4758810193/jobs/8457448052?pr=1036)? Thanks! This seems to be my second time to see the error message. Thanks! |
@wilsonwang371 @iycheng would you mind taking a look at this PR? You only need to take a look at |
// RAY_GCS_RPC_SERVER_RECONNECT_TIMEOUT_S to 600s. If the worker cannot reconnect to GCS within | ||
// 600s, the Raylet will exit the process. By default, the value is 60s, so the head node will | ||
// crash if the GCS server is down for more than 60s. Typically, the new GCS server will be available | ||
// in 120 seconds, so we set the timeout to 600s to avoid the worker nodes crashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to set it 600? Will 120 or similar values work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120 seconds works in my local Kind cluster, but it may not work in a production Kubernetes cluster. Users can easily change the value by setting an environment variable. In addition, it is possible for some Pods on public cloud to experience network disconnections for several minutes. That's why I decided to select a higher timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping @wilsonwang371
Does this make sense to you?
Merge this PR because this blocks #1055. Feel free to add more comments on this PR, and I will open another PR to address the comments. |
…pod is killed (ray-project#1036) Worker pods crash unexpectedly when gcs_server on head pod is killed
Why are these changes needed?
Major changes
When we kill the GCS server process (
pkill gcs_server
) on the head Pod, a Raylet (?) will exit the process if it cannot connect to the GCS server afterRAY_gcs_rpc_server_reconnect_timeout_s
(code). By default, the value is 60s.In #634, the
RAY_gcs_rpc_server_reconnect_timeout_s
values for both head and workers are set to 60 seconds. This means that the head and workers will crash almost simultaneously, 60 seconds after the GCS server process is terminated. Therefore, we need to ensure that theRAY_gcs_rpc_server_reconnect_timeout_s
for the workers is greater than the time it takes for a new GCS server process to become available after the old one is terminated. That is,This PR injects
RAY_gcs_rpc_server_reconnect_timeout_s
env variable to workers with default value 600s if GCS FT is enabled. Hence, head will crash 60 seconds after the process is terminated, and the new GCS server process will be ready before 600 seconds. Hence, the workers will not crash.Minor changes
Remove
rayCluster
label: The label has been removed by Revise sample configs, increase memory requests, update Ray versions #761 from sample YAML files. This PR removes the label from unit tests.compatibility-test.py: This is just for refactoring.
ray-cluster.ray-ft.yaml.template: Set
RAY_gcs_rpc_server_reconnect_timeout_s
to make the compatibility test become faster.Related issue number
Closes #634
Checks