You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The simple example provided in the docs works but as soon as I switch to an external redis source, the cluster doesn't start up in a healthy manner and the only logs I can find reference a gcs issue. I would expect that after switching from a redis pod on the head node to an external redis service that the rest of the ray service works the same and nothing changes.
Versions / Dependencies
The versions being used are noted in the yamls below but I'll list them here too:
Cluster api version: v1alpha1
Kuberay version: 1.0.0-rc.0
Ray version: 2.7.1
Reproduction script
This is the working version:
kind: ConfigMap
apiVersion: v1
metadata:
name: redis-config
labels:
app: redis
data:
redis.conf: |-
dir /data
port 6379
bind 0.0.0.0
appendonly yes
protected-mode no
requirepass 5241590000000000
pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: ClusterIP
ports:
- name: redis
port: 6379
selector:
app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
labels:
app: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:5.0.8
command:
- "sh"
- "-c"
- "redis-server /usr/local/etc/redis/redis.conf"
ports:
- containerPort: 6379
volumeMounts:
- name: config
mountPath: /usr/local/etc/redis/redis.conf
subPath: redis.conf
volumes:
- name: config
configMap:
name: redis-config
---
# Redis password
apiVersion: v1
kind: Secret
metadata:
name: redis-password-secret
type: Opaque
data:
# echo -n "5241590000000000" | base64
password: NTI0MTU5MDAwMDAwMDAwMA==
---
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
annotations:
ray.io/ft-enabled: "true" # enable Ray GCS FT
# In most cases, you don't need to set `ray.io/external-storage-namespace` because KubeRay will
# automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand
# the behaviors of the Ray GCS FT and RayService to avoid misconfiguration.
# [Example]:
# ray.io/external-storage-namespace: "my-raycluster-storage"
name: raycluster-same-machine-redis
spec:
rayVersion: '2.7.0'
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
# Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
num-cpus: "0"
# redis-password should match "requirepass" in redis.conf in the ConfigMap above.
# Ray 2.3.0 changes the default redis password from "5241590000000000" to "".
redis-password: $REDIS_PASSWORD
# Pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.7.0
resources:
limits:
cpu: "1"
requests:
cpu: "1"
env:
# Ray will read the RAY_REDIS_ADDRESS environment variable to establish
# a connection with the Redis server. In this instance, we use the "redis"
# Kubernetes ClusterIP service name, also created by this YAML, as the
# connection point to the Redis server.
- name: RAY_REDIS_ADDRESS
value: redis:6379
# This environment variable is used in the `rayStartParams` above.
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-password-secret
key: password
ports:
- containerPort: 6379
name: redis
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /home/ray/samples
name: ray-example-configmap
volumes:
- name: ray-logs
emptyDir: {}
- name: ray-example-configmap
configMap:
name: ray-example
defaultMode: 0777
items:
- key: detached_actor.py
path: detached_actor.py
- key: increment_counter.py
path: increment_counter.py
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: small-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.7.0
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
requests:
cpu: "1"
volumes:
- name: ray-logs
emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example
data:
detached_actor.py: |
import ray
@ray.remote(num_cpus=1)
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
ray.init(namespace="default_namespace")
Counter.options(name="counter_actor", lifetime="detached").remote()
increment_counter.py: |
import ray
ray.init(namespace="default_namespace")
counter = ray.get_actor("counter_actor")
print(ray.get(counter.increment.remote()))
And this is the version that doesn't work:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
annotations:
ray.io/ft-enabled: "true" # enable Ray GCS FT
# In most cases, you don't need to set `ray.io/external-storage-namespace` because KubeRay will
# automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand
# the behaviors of the Ray GCS FT and RayService to avoid misconfiguration.
# [Example]:
# ray.io/external-storage-namespace: "my-raycluster-storage"
name: raycluster-external-redis-v2
spec:
rayVersion: '2.7.0'
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
# Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
num-cpus: "0"
# redis-password should match "requirepass" in redis.conf in the ConfigMap above.
# Ray 2.3.0 changes the default redis password from "5241590000000000" to "".
redis-password: $REDIS_PASSWORD
# Pod template
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.7.0
resources:
limits:
cpu: "1"
requests:
cpu: "1"
env:
# Ray will read the RAY_REDIS_ADDRESS environment variable to establish
# a connection with the Redis server.
- name: RAY_REDIS_ADDRESS
value: our_external_redis_address:6379
# This environment variable is used in the `rayStartParams` above.
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: repo
key: redis_passwd
ports:
- containerPort: 6379
name: redis
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /home/ray/samples
name: ray-example-ha-external-configmap
volumes:
- name: ray-logs
emptyDir: {}
- name: ray-example-ha-external-configmap
configMap:
name: ray-example-ha-external
defaultMode: 0777
items:
- key: detached_actor.py
path: detached_actor.py
- key: increment_counter.py
path: increment_counter.py
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 10
groupName: small-group
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams: {}
# Pod template
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.7.0
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "1"
requests:
cpu: "1"
volumes:
- name: ray-logs
emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-example-ha-external
data:
detached_actor.py: |
import ray
@ray.remote(num_cpus=1)
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
ray.init(namespace="default_namespace")
Counter.options(name="counter_actor", lifetime="detached").remote()
increment_counter.py: |
import ray
ray.init(namespace="default_namespace")
counter = ray.get_actor("counter_actor")
print(ray.get(counter.increment.remote()))
I have previously connected to the head node and confirmed that the redis address and password environment variables are passed through correctly.
I've also been told by our devops team that the external redis is being used by several other teams and services so that is running.
We use argo to synchronize and apply deployment changes. On argo, the only errors we seem to get are for the worker group and we see this:
1 seconds elapsed: Waiting for GCS to be ready.8 seconds elapsed: Waiting for GCS to be ready.15 seconds elapsed: Waiting for GCS to be ready.21 seconds elapsed: Waiting for GCS to be ready.28 seconds elapsed: Waiting for GCS to be ready.34 seconds elapsed: Waiting for GCS to be ready.41 seconds elapsed: Waiting for GCS to be ready.47 seconds elapsed: Waiting for GCS to be ready.54 seconds elapsed: Waiting for GCS to be ready.61 seconds elapsed: Waiting for GCS to be ready.67 seconds elapsed: Waiting for GCS to be ready.74 seconds elapsed: Waiting for GCS to be ready.80 seconds elapsed: Waiting for GCS to be ready.87 seconds elapsed: Waiting for GCS to be ready.94 seconds elapsed: Waiting for GCS to be ready.100 seconds elapsed: Waiting for GCS to be ready.107 seconds elapsed: Waiting for GCS to be ready.113 seconds elapsed: Waiting for GCS to be ready.120 seconds elapsed: Waiting for GCS to be ready.Traceback (most recent call last): File "python/ray/_raylet.pyx", line 2929, in ray._raylet.check_health File "python/ray/_raylet.pyx", line 455, in ray._raylet.check_statusray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.115.241:6379: Failed to connect to remote host: Connection refused126 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.Traceback (most recent call last): File "python/ray/_raylet.pyx", line 2929, in ray._raylet.check_health File "python/ray/_raylet.pyx", line 455, in ray._raylet.check_statusray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.115.241:6379: Failed to connect to remote host: Connection refused133 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md
If I attempt to use kubectl commands to get logs for the cluster or the pods, there is no further information.
When I check the /tmp/ray folder on the head pod, it is empty.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
ashwindcruz
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 23, 2023
Thanks for the quick response @alexeykudinkin .
For redis, we are using this via AWS so we don't have full configuration details.
Our infra team has informed me that we are using one of the default parameter groups, default.redis5.0.cluster.on. I think details about this can be found here.
Additionally, we have enabled Cluster mode and Encryption in transit.
What happened + What you expected to happen
I am attempting to add end-to-end fault tolerance to my ray cluster using instructions from these sources:
https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html#step-2-add-redis-info-to-rayservice
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-external-storage-namespace-example
And using this file as a guide:
https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml
The simple example provided in the docs works but as soon as I switch to an external redis source, the cluster doesn't start up in a healthy manner and the only logs I can find reference a gcs issue. I would expect that after switching from a redis pod on the head node to an external redis service that the rest of the ray service works the same and nothing changes.
Versions / Dependencies
The versions being used are noted in the yamls below but I'll list them here too:
v1alpha1
1.0.0-rc.0
Reproduction script
This is the working version:
And this is the version that doesn't work:
I have previously connected to the head node and confirmed that the redis address and password environment variables are passed through correctly.
I've also been told by our devops team that the external redis is being used by several other teams and services so that is running.
We use argo to synchronize and apply deployment changes. On argo, the only errors we seem to get are for the worker group and we see this:
If I attempt to use kubectl commands to get logs for the cluster or the pods, there is no further information.
When I check the
/tmp/ray
folder on the head pod, it is empty.Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: