[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

smit-kiri · 2023-08-07T19:18:29Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to move all our workloads from single application, to multi-application RayService with the release of KubeRay v0.6.0, and it does not seem possible to do it without downtime if we're using GCS FT. I see the following error:

ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application 
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
the multi-app API endpoint `/api/serve/applications/`.

Reproduction script

Single application:

demo.py

import time

from ray import serve
from ray.serve.drivers import DAGDriver


@serve.deployment(name="model1")
class Model1:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        data: dict = await http_request.json()
        data["model"] = "model1"
        return data


@serve.deployment(name="model2")
class Model2:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        data: dict = await http_request.json()
        data["model"] = "model2"
        return data


driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()})  # type: ignore

Dockerfile

FROM rayproject/ray:2.6.1-py310 as common

ENV WORKING_DIR /home/ray/models

WORKDIR ${WORKING_DIR}

ADD ./demo.py ${WORKING_DIR}

rayservice_config.yaml

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: 'true'
    ray.io/external-storage-namespace: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: demo:driver
    deployments:
    - name: model1
      numReplicas: 1
    - name: model2
      numReplicas: 1

  rayClusterConfig:
    rayVersion: 2.6.1   # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        dashboard-host: 0.0.0.0
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always

            env:
            - name: RAY_LOG_TO_STDERR
              value: '1'
            - name: RAY_REDIS_ADDRESS
              value: 
                redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265     # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 15
        # logical group name, for this called small-group, also can be functional
      groupName: small-group
      rayStartParams: {}
        #pod template
      template:
        spec:
          containers:
          - name: ray-worker     # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command: [/bin/sh, -c, ray stop]
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi

Multi-application

demo1.py

import time

from ray import serve


@serve.deployment(name="model1")
class Model1:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        # Another dummy change
        data: dict = await http_request.json()
        data["model"] = "model1"
        return data


model1 = Model1.bind()  # type: ignore

demo2.py

import time

from ray import serve


@serve.deployment(name="model2")
class Model2:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        # Dummy change
        data: dict = await http_request.json()
        data["model"] = "model2"
        return data


model2 = Model2.bind()  # type: ignore

Dockerfile

FROM rayproject/ray:2.6.1-py310 as common

ENV WORKING_DIR /home/ray/models

WORKDIR ${WORKING_DIR}

ADD ./model_deployments/demo.py ${WORKING_DIR}
ADD ./model_deployments/demo2.py ${WORKING_DIR}

rayservice_config.yaml

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: 'true'
    ray.io/external-storage-namespace: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60.
  serveConfigV2:
    applications:
      - name: app1
        route_prefix: "/model1"
        import_path: "demo1:model1"
        deployments:
          - name: "model1"
            num_replicas: 1
    
      - name: app2
        route_prefix: "/model2"
        import_path: "demo2:model2"
        deployments:
          - name: "model2"
            num_replicas: 1

  rayClusterConfig:
    rayVersion: 2.6.1   # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        dashboard-host: 0.0.0.0
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always

            env:
            - name: RAY_LOG_TO_STDERR
              value: '1'
            - name: RAY_REDIS_ADDRESS
              value: 
                redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265     # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 15
        # logical group name, for this called small-group, also can be functional
      groupName: small-group
      rayStartParams: {}
        #pod template
      template:
        spec:
          containers:
          - name: ray-worker     # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command: [/bin/sh, -c, ray stop]
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi

Deploy the single application code first. Then try to deploy the multi-application code. You should see an error.

Anything else

A workaround here is:

Deploy the multi-app config without GCS FT.
Reboot the Redis instance
Add GCS FT back in again.

If you don't reboot the Redis instance, you run into the same error again when trying to add GCS FT back in.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

smit-kiri · 2023-08-14T13:06:18Z

This might be related to setting ray.io/external-storage-namespace explicitly

kevin85421 · 2023-08-24T07:50:06Z

Thank @smit-kiri for reporting this issue! This seems to have no relationship with GCS FT for me. I can reproduce the issue by:

Create a RayService using serveConfig (Ray Serve API V1) with this YAML.
Comment out serveConfig and uncomment serveConfigV2 in the YAML. Use kubectl apply to update the RayService.

Check KubeRay operator logs

2023-08-24T07:23:44.201Z	ERROR	controllers.RayService	fail to update deployment	{"error": "UpdateDeployments fail: 400 Bad Request \u001b[36mray::ServeController.deploy_apps()\u001b[39m (pid=302, ip=10.244.0.6, actor_id=26b13b037565cbf4d5afcb5701000000, repr=<ray.serve.controller.ServeController object at 0x7f50291c2490>)\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 428, in result\n    return self.__get_result()\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 384, in __get_result\n    raise self._exception\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py\", line 538, in deploy_apps\n    \"You are trying to deploy a multi-application config, however \"\nray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the the multi-app API endpoint `/api/serve/applications/`."}

Ray Serve seems not to allow in-place upgrades between API V1 (single app) and API V2 (multi app). A workaround involves not only updating serveConfig / serveConfigV2 but also modifying rayVersion which has no effect when the Ray version is 2.0.0 or later to 2.100.0. This will trigger a new RayCluster preparation instead of an in-place update.

smit-kiri · 2023-08-24T12:46:01Z

Thanks @kevin85421 !
I was able to get around it by setting a different ray.io/external-storage-namespace

kevin85421 · 2023-08-24T17:12:07Z

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

Cool. I am still a bit confused. Do you only update serveConfig / serveConfigV2, or do you also update other fields? In the former case, it will only update the serve configurations in-place, while the latter case will trigger a zero-downtime upgrade. In my understanding, the former case will always report the exception ray.serve.exceptions.RayServeException whenever you upgrade from API V1 to API V2. If you trigger a zero-downtime upgrade, the different ray.io/external-storage-namespace solution makes sense to me.

smit-kiri · 2023-08-24T20:00:47Z

We triggered a zero-downtime upgrade by updating the docker image

kevin85421 · 2023-08-24T20:47:38Z

Update the doc: ray-project/ray@ec19d15.

kevin85421 · 2023-08-26T05:52:33Z

ray-project/ray#38647 is merged. Close this issue.

smit-kiri added the bug Something isn't working label Aug 7, 2023

smit-kiri changed the title ~~[Bug] Cannot move from single app to multi-app without downtime if using GCS FT~~ [Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT Aug 7, 2023

kevin85421 mentioned this issue Apr 19, 2023

[Umbrella] GCS fault tolerance on KubeRay #1033

Open

22 tasks

kevin85421 added rayservice gcs ft labels Aug 7, 2023

kevin85421 removed the gcs ft label Aug 24, 2023

kevin85421 self-assigned this Aug 24, 2023

kevin85421 closed this as completed Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

smit-kiri commented Aug 7, 2023 •

edited

Loading

smit-kiri commented Aug 14, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 24, 2023

kevin85421 commented Aug 26, 2023

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

Comments

smit-kiri commented Aug 7, 2023 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Single application:

Multi-application

Anything else

Are you willing to submit a PR?

smit-kiri commented Aug 14, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 24, 2023

smit-kiri commented Aug 24, 2023

kevin85421 commented Aug 24, 2023

kevin85421 commented Aug 26, 2023

smit-kiri commented Aug 7, 2023 •

edited

Loading