[RFC] Ray Serve model multiplexing support

TL;DR - This RFC proposes a new API to serve multiple models on a single Ray Serve deployment without splitting the models into separate deployments or Serve applications.

## Problem Statement

Recently, the Serve team has seen growing interest in serving many independent models. In Ray 2.3 we added multi-app support, so you could run multiple, independent Serve applications on a single Ray cluster. Multi-app support lets Serve apps share resources at the cluster level, but there are some use cases better suited to sharing resources at the deployment level.

For example, a user may wish to run a Recommender deployment that serves personalized recommendations to their customers. Each end customer may have a personalized model that’s fine-tuned for their needs. For this use case, the user may have hundreds of recommendation models that should all be served by Recommender deployment replicas, rather than getting split into individual deployments for each model.

For this use case, Serve could provide an API that lets deployment replicas transparently load model weights from a set of models, route traffic to replicas that contain the requested model, and evict model weights whenever new models are needed.

In short, Serve deployments would offer model multiplexing.

## Proposal

Introduce a new deployment decorator, @serve.multiplex, to Ray Serve. You can pass the argument max_models_per_replicas to the decorator to control how many models can be loaded within each replica.

```
@serve.deployment(num_replicas=6)
class UserDeployment:
    def __init__(self, model_id: str):
        self._model = load_model(model_id)
        logger.info(f"loading model from {model_path}")

        @serve.multiplexed(max_num_models_per_replica=5)
        async def load_model(self, model_id: str) -> Any:
            # Load model with the given tag
            # You can use any model loading library here
            # and return the loaded model. load_from_s3 is
            # a placeholder function.
            return load_from_s3(model_id)

    def __call__(self, request: starlette.requests.Request):
            # Get the model_id from the request context.
            model_id = serve.get_multiplexed_model_id()
            # Load the model for the requested model_id.
            # If the model is already cached locally,
            # this will just be a dictionary lookup.
            model = await self.load_model(model_id)
            return model(request)
```

UserDeployment will have 6 replicas, each running at most 3 models.

To run this code
```
serve.run(UserDeployment.bind(), route_prefix=”/app”)
```

To query the endpoint, Ray Serve will use the model_id as tag to route the traffic
```
resp = requests.get(
    "http://127.0.0.1:8000/app", headers={"ray_serve_request_routing_tag": "1"}
)
```

When there’s no space left to load a new model in the cluster, Serve will use an LRU policy to evict existing models from an arbitrary replica and load the new model.

Autoscaling will still happen on the replica level. There’s no per-model autoscaling with this proposal. Replicas will be scaled up or down based on the QPS  and request queue metrics. 

There could potentially be more advanced options in the @serve.multiplex decorator, such as memory size limit, qps limit, etc. In this proposal, we start with the simplest one: setting the number of models per replica. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Ray Serve model multiplexing support #33253

Problem Statement

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Ray Serve model multiplexing support #33253

Description

Problem Statement

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions