Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT #1297

Closed
1 of 2 tasks
smit-kiri opened this issue Aug 7, 2023 · 7 comments
Closed
1 of 2 tasks
Assignees
Labels
bug Something isn't working rayservice

Comments

@smit-kiri
Copy link

smit-kiri commented Aug 7, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I'm trying to move all our workloads from single application, to multi-application RayService with the release of KubeRay v0.6.0, and it does not seem possible to do it without downtime if we're using GCS FT. I see the following error:

ray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application 
config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either
redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a
multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the
the multi-app API endpoint `/api/serve/applications/`.

Reproduction script

Single application:

demo.py
import time

from ray import serve
from ray.serve.drivers import DAGDriver


@serve.deployment(name="model1")
class Model1:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        data: dict = await http_request.json()
        data["model"] = "model1"
        return data


@serve.deployment(name="model2")
class Model2:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        data: dict = await http_request.json()
        data["model"] = "model2"
        return data


driver = DAGDriver.bind({"/model1": Model1.bind(), "/model2": Model2.bind()})  # type: ignore
Dockerfile
FROM rayproject/ray:2.6.1-py310 as common

ENV WORKING_DIR /home/ray/models

WORKDIR ${WORKING_DIR}

ADD ./demo.py ${WORKING_DIR}
rayservice_config.yaml
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: 'true'
    ray.io/external-storage-namespace: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60.
  serveConfig:
    importPath: demo:driver
    deployments:
    - name: model1
      numReplicas: 1
    - name: model2
      numReplicas: 1

  rayClusterConfig:
    rayVersion: 2.6.1   # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        dashboard-host: 0.0.0.0
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always

            env:
            - name: RAY_LOG_TO_STDERR
              value: '1'
            - name: RAY_REDIS_ADDRESS
              value: 
                redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265     # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 15
        # logical group name, for this called small-group, also can be functional
      groupName: small-group
      rayStartParams: {}
        #pod template
      template:
        spec:
          containers:
          - name: ray-worker     # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command: [/bin/sh, -c, ray stop]
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi

Multi-application

demo1.py
import time

from ray import serve


@serve.deployment(name="model1")
class Model1:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        # Another dummy change
        data: dict = await http_request.json()
        data["model"] = "model1"
        return data


model1 = Model1.bind()  # type: ignore
demo2.py
import time

from ray import serve


@serve.deployment(name="model2")
class Model2:
    def __int__(self):
        # Simulate an init method
        time.sleep(60)

    async def __call__(self, http_request):
        # Dummy change
        data: dict = await http_request.json()
        data["model"] = "model2"
        return data


model2 = Model2.bind()  # type: ignore
Dockerfile
FROM rayproject/ray:2.6.1-py310 as common

ENV WORKING_DIR /home/ray/models

WORKDIR ${WORKING_DIR}

ADD ./model_deployments/demo.py ${WORKING_DIR}
ADD ./model_deployments/demo2.py ${WORKING_DIR}
rayservice_config.yaml
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: rayservice-sample
  annotations:
    ray.io/ft-enabled: 'true'
    ray.io/external-storage-namespace: rayservice-sample
spec:
  serviceUnhealthySecondThreshold: 900 # Config for the health check threshold for service. Default value is 60.
  deploymentUnhealthySecondThreshold: 900 # Config for the health check threshold for deployments. Default value is 60.
  serveConfigV2:
    applications:
      - name: app1
        route_prefix: "/model1"
        import_path: "demo1:model1"
        deployments:
          - name: "model1"
            num_replicas: 1
    
      - name: app2
        route_prefix: "/model2"
        import_path: "demo2:model2"
        deployments:
          - name: "model2"
            num_replicas: 1

  rayClusterConfig:
    rayVersion: 2.6.1   # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
      rayStartParams:
        port: '6379' # should match container port named gcs-server
        dashboard-host: 0.0.0.0
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always

            env:
            - name: RAY_LOG_TO_STDERR
              value: '1'
            - name: RAY_REDIS_ADDRESS
              value: 
                redis://xxxxx.ng.0001.use1.cache.amazonaws.com:6379
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265     # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
    workerGroupSpecs:
      # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 15
        # logical group name, for this called small-group, also can be functional
      groupName: small-group
      rayStartParams: {}
        #pod template
      template:
        spec:
          containers:
          - name: ray-worker     # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: DOCKER_IMAGE_URL
            imagePullPolicy: Always
            lifecycle:
              preStop:
                exec:
                  command: [/bin/sh, -c, ray stop]
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi

Deploy the single application code first. Then try to deploy the multi-application code. You should see an error.

Anything else

A workaround here is:

  • Deploy the multi-app config without GCS FT.
  • Reboot the Redis instance
  • Add GCS FT back in again.

If you don't reboot the Redis instance, you run into the same error again when trying to add GCS FT back in.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@smit-kiri smit-kiri added the bug Something isn't working label Aug 7, 2023
@smit-kiri smit-kiri changed the title [Bug] Cannot move from single app to multi-app without downtime if using GCS FT [Bug] [RayService] Cannot move from single app to multi-app without downtime if using GCS FT Aug 7, 2023
@smit-kiri
Copy link
Author

This might be related to setting ray.io/external-storage-namespace explicitly

@kevin85421 kevin85421 removed the gcs ft label Aug 24, 2023
@kevin85421
Copy link
Member

Thank @smit-kiri for reporting this issue! This seems to have no relationship with GCS FT for me. I can reproduce the issue by:

  • Create a RayService using serveConfig (Ray Serve API V1) with this YAML.

  • Comment out serveConfig and uncomment serveConfigV2 in the YAML. Use kubectl apply to update the RayService.

  • Check KubeRay operator logs

    2023-08-24T07:23:44.201Z	ERROR	controllers.RayService	fail to update deployment	{"error": "UpdateDeployments fail: 400 Bad Request \u001b[36mray::ServeController.deploy_apps()\u001b[39m (pid=302, ip=10.244.0.6, actor_id=26b13b037565cbf4d5afcb5701000000, repr=<ray.serve.controller.ServeController object at 0x7f50291c2490>)\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 428, in result\n    return self.__get_result()\n  File \"/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py\", line 384, in __get_result\n    raise self._exception\n  File \"/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py\", line 538, in deploy_apps\n    \"You are trying to deploy a multi-application config, however \"\nray.serve.exceptions.RayServeException: You are trying to deploy a multi-application config, however a single-application config has been deployed to the current Serve instance already. Mixing single-app and multi-app is not allowed. Please either redeploy using the single-application config format `ServeApplicationSchema`, or shutdown and restart Serve to submit a multi-app config of format `ServeDeploySchema`. If you are using the REST API, you can submit a multi-app config to the the multi-app API endpoint `/api/serve/applications/`."}
    

Ray Serve seems not to allow in-place upgrades between API V1 (single app) and API V2 (multi app). A workaround involves not only updating serveConfig / serveConfigV2 but also modifying rayVersion which has no effect when the Ray version is 2.0.0 or later to 2.100.0. This will trigger a new RayCluster preparation instead of an in-place update.

@smit-kiri
Copy link
Author

Thanks @kevin85421 !
I was able to get around it by setting a different ray.io/external-storage-namespace

@kevin85421
Copy link
Member

Thanks @kevin85421 ! I was able to get around it by setting a different ray.io/external-storage-namespace

Cool. I am still a bit confused. Do you only update serveConfig / serveConfigV2, or do you also update other fields? In the former case, it will only update the serve configurations in-place, while the latter case will trigger a zero-downtime upgrade. In my understanding, the former case will always report the exception ray.serve.exceptions.RayServeException whenever you upgrade from API V1 to API V2. If you trigger a zero-downtime upgrade, the different ray.io/external-storage-namespace solution makes sense to me.

@kevin85421 kevin85421 self-assigned this Aug 24, 2023
@smit-kiri
Copy link
Author

We triggered a zero-downtime upgrade by updating the docker image

@kevin85421
Copy link
Member

Update the doc: ray-project/ray@ec19d15.

@kevin85421
Copy link
Member

ray-project/ray#38647 is merged. Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rayservice
Projects
None yet
Development

No branches or pull requests

2 participants