Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault Tolerance not working with external redis #40562

Closed
Tracked by #1033
ashwindcruz opened this issue Oct 23, 2023 · 4 comments
Closed
Tracked by #1033

Fault Tolerance not working with external redis #40562

ashwindcruz opened this issue Oct 23, 2023 · 4 comments
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@ashwindcruz
Copy link

What happened + What you expected to happen

I am attempting to add end-to-end fault tolerance to my ray cluster using instructions from these sources:
https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html#step-2-add-redis-info-to-rayservice
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-external-storage-namespace-example
And using this file as a guide:
https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.external-redis.yaml

The simple example provided in the docs works but as soon as I switch to an external redis source, the cluster doesn't start up in a healthy manner and the only logs I can find reference a gcs issue. I would expect that after switching from a redis pod on the head node to an external redis service that the rest of the ray service works the same and nothing changes.

Versions / Dependencies

The versions being used are noted in the yamls below but I'll list them here too:

  • Cluster api version: v1alpha1
  • Kuberay version: 1.0.0-rc.0
  • Ray version: 2.7.1

Reproduction script

This is the working version:

kind: ConfigMap
apiVersion: v1
metadata:
  name: redis-config
  labels:
    app: redis
data:
  redis.conf: |-
    dir /data
    port 6379
    bind 0.0.0.0
    appendonly yes
    protected-mode no
    requirepass 5241590000000000
    pidfile /data/redis-6379.pid
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: ClusterIP
  ports:
    - name: redis
      port: 6379
  selector:
    app: redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:5.0.8
          command:
            - "sh"
            - "-c"
            - "redis-server /usr/local/etc/redis/redis.conf"
          ports:
            - containerPort: 6379
          volumeMounts:
            - name: config
              mountPath: /usr/local/etc/redis/redis.conf
              subPath: redis.conf
      volumes:
        - name: config
          configMap:
            name: redis-config
---
# Redis password
apiVersion: v1
kind: Secret
metadata:
  name: redis-password-secret
type: Opaque
data:
  # echo -n "5241590000000000" | base64
  password: NTI0MTU5MDAwMDAwMDAwMA==
---
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  annotations:
    ray.io/ft-enabled: "true" # enable Ray GCS FT
    # In most cases, you don't need to set `ray.io/external-storage-namespace` because KubeRay will
    # automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand
    # the behaviors of the Ray GCS FT and RayService to avoid misconfiguration.
    # [Example]:
    # ray.io/external-storage-namespace: "my-raycluster-storage"
  name: raycluster-same-machine-redis
spec:
  rayVersion: '2.7.0'
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
      # redis-password should match "requirepass" in redis.conf in the ConfigMap above.
      # Ray 2.3.0 changes the default redis password from "5241590000000000" to "".
      redis-password: $REDIS_PASSWORD
    # Pod template
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.7.0
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "1"
            env:
              # Ray will read the RAY_REDIS_ADDRESS environment variable to establish
              # a connection with the Redis server. In this instance, we use the "redis"
              # Kubernetes ClusterIP service name, also created by this YAML, as the
              # connection point to the Redis server.
              - name: RAY_REDIS_ADDRESS
                value: redis:6379
              # This environment variable is used in the `rayStartParams` above.
              - name: REDIS_PASSWORD
                valueFrom:
                  secretKeyRef:
                    name: redis-password-secret
                    key: password
            ports:
              - containerPort: 6379
                name: redis
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
              - mountPath: /home/ray/samples
                name: ray-example-configmap
        volumes:
          - name: ray-logs
            emptyDir: {}
          - name: ray-example-configmap
            configMap:
              name: ray-example
              defaultMode: 0777
              items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: increment_counter.py
                  path: increment_counter.py
  workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray:2.7.0
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "1"
          volumes:
            - name: ray-logs
              emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example
data:
  detached_actor.py: |
    import ray

    @ray.remote(num_cpus=1)
    class Counter:
      def __init__(self):
          self.value = 0

      def increment(self):
          self.value += 1
          return self.value

    ray.init(namespace="default_namespace")
    Counter.options(name="counter_actor", lifetime="detached").remote()
  increment_counter.py: |
    import ray

    ray.init(namespace="default_namespace")
    counter = ray.get_actor("counter_actor")
    print(ray.get(counter.increment.remote()))

And this is the version that doesn't work:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  annotations:
    ray.io/ft-enabled: "true" # enable Ray GCS FT
    # In most cases, you don't need to set `ray.io/external-storage-namespace` because KubeRay will
    # automatically set it to the UID of RayCluster. Only modify this annotation if you fully understand
    # the behaviors of the Ray GCS FT and RayService to avoid misconfiguration.
    # [Example]:
    # ray.io/external-storage-namespace: "my-raycluster-storage"
  name: raycluster-external-redis-v2
spec:
  rayVersion: '2.7.0'
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      # Setting "num-cpus: 0" to avoid any Ray actors or tasks being scheduled on the Ray head Pod.
      num-cpus: "0"
      # redis-password should match "requirepass" in redis.conf in the ConfigMap above.
      # Ray 2.3.0 changes the default redis password from "5241590000000000" to "".
      redis-password: $REDIS_PASSWORD
    # Pod template
    template:
      spec:
        containers:
          - name: ray-head
            image: rayproject/ray:2.7.0
            resources:
              limits:
                cpu: "1"
              requests:
                cpu: "1"
            env:
              # Ray will read the RAY_REDIS_ADDRESS environment variable to establish
              # a connection with the Redis server.
              - name: RAY_REDIS_ADDRESS
                value: our_external_redis_address:6379
              # This environment variable is used in the `rayStartParams` above.
              - name: REDIS_PASSWORD
                valueFrom:
                  secretKeyRef:
                    name: repo
                    key: redis_passwd
            ports:
              - containerPort: 6379
                name: redis
              - containerPort: 8265
                name: dashboard
              - containerPort: 10001
                name: client
            volumeMounts:
              - mountPath: /tmp/ray
                name: ray-logs
              - mountPath: /home/ray/samples
                name: ray-example-ha-external-configmap
        volumes:
          - name: ray-logs
            emptyDir: {}
          - name: ray-example-ha-external-configmap
            configMap:
              name: ray-example-ha-external
              defaultMode: 0777
              items:
                - key: detached_actor.py
                  path: detached_actor.py
                - key: increment_counter.py
                  path: increment_counter.py
  workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      # Pod template
      template:
        spec:
          containers:
            - name: ray-worker
              image: rayproject/ray:2.7.0
              volumeMounts:
                - mountPath: /tmp/ray
                  name: ray-logs
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "1"
          volumes:
            - name: ray-logs
              emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-example-ha-external
data:
  detached_actor.py: |
    import ray

    @ray.remote(num_cpus=1)
    class Counter:
      def __init__(self):
          self.value = 0

      def increment(self):
          self.value += 1
          return self.value

    ray.init(namespace="default_namespace")
    Counter.options(name="counter_actor", lifetime="detached").remote()
  increment_counter.py: |
    import ray

    ray.init(namespace="default_namespace")
    counter = ray.get_actor("counter_actor")
    print(ray.get(counter.increment.remote()))

I have previously connected to the head node and confirmed that the redis address and password environment variables are passed through correctly.
I've also been told by our devops team that the external redis is being used by several other teams and services so that is running.

We use argo to synchronize and apply deployment changes. On argo, the only errors we seem to get are for the worker group and we see this:

1 seconds elapsed: Waiting for GCS to be ready.8 seconds elapsed: Waiting for GCS to be ready.15 seconds elapsed: Waiting for GCS to be ready.21 seconds elapsed: Waiting for GCS to be ready.28 seconds elapsed: Waiting for GCS to be ready.34 seconds elapsed: Waiting for GCS to be ready.41 seconds elapsed: Waiting for GCS to be ready.47 seconds elapsed: Waiting for GCS to be ready.54 seconds elapsed: Waiting for GCS to be ready.61 seconds elapsed: Waiting for GCS to be ready.67 seconds elapsed: Waiting for GCS to be ready.74 seconds elapsed: Waiting for GCS to be ready.80 seconds elapsed: Waiting for GCS to be ready.87 seconds elapsed: Waiting for GCS to be ready.94 seconds elapsed: Waiting for GCS to be ready.100 seconds elapsed: Waiting for GCS to be ready.107 seconds elapsed: Waiting for GCS to be ready.113 seconds elapsed: Waiting for GCS to be ready.120 seconds elapsed: Waiting for GCS to be ready.Traceback (most recent call last):  File "python/ray/_raylet.pyx", line 2929, in ray._raylet.check_health  File "python/ray/_raylet.pyx", line 455, in ray._raylet.check_statusray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.115.241:6379: Failed to connect to remote host: Connection refused126 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.Traceback (most recent call last):  File "python/ray/_raylet.pyx", line 2929, in ray._raylet.check_health  File "python/ray/_raylet.pyx", line 455, in ray._raylet.check_statusray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:172.20.115.241:6379: Failed to connect to remote host: Connection refused133 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md

If I attempt to use kubectl commands to get logs for the cluster or the pods, there is no further information.
When I check the /tmp/ray folder on the head pod, it is empty.

Issue Severity

High: It blocks me from completing my task.

@ashwindcruz ashwindcruz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 23, 2023
@anyscalesam anyscalesam added the serve Ray Serve Related Issue label Oct 23, 2023
@anyscalesam
Copy link
Contributor

Starting troubleshooting from the top of the stack at the Libraries layer first; @alexeykudinkin can you please take a look?

@alexeykudinkin
Copy link
Contributor

@ashwindcruz can you paste your external Redis configuration?

@ashwindcruz
Copy link
Author

ashwindcruz commented Oct 25, 2023

Thanks for the quick response @alexeykudinkin .
For redis, we are using this via AWS so we don't have full configuration details.
Our infra team has informed me that we are using one of the default parameter groups, default.redis5.0.cluster.on. I think details about this can be found here.
Additionally, we have enabled Cluster mode and Encryption in transit.

@sihanwang41 sihanwang41 added P1 Issue that should be fixed within a few weeks and removed serve Ray Serve Related Issue labels Oct 30, 2023
@anyscalesam anyscalesam added serve Ray Serve Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 30, 2023
@anyscalesam
Copy link
Contributor

@ashwindcruz is this issue still occurring for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants