Skip to content

[RFC] Ray Serve model multiplexing support #33253

Closed
@sihanwang41

Description

TL;DR - This RFC proposes a new API to serve multiple models on a single Ray Serve deployment without splitting the models into separate deployments or Serve applications.

Problem Statement

Recently, the Serve team has seen growing interest in serving many independent models. In Ray 2.3 we added multi-app support, so you could run multiple, independent Serve applications on a single Ray cluster. Multi-app support lets Serve apps share resources at the cluster level, but there are some use cases better suited to sharing resources at the deployment level.

For example, a user may wish to run a Recommender deployment that serves personalized recommendations to their customers. Each end customer may have a personalized model that’s fine-tuned for their needs. For this use case, the user may have hundreds of recommendation models that should all be served by Recommender deployment replicas, rather than getting split into individual deployments for each model.

For this use case, Serve could provide an API that lets deployment replicas transparently load model weights from a set of models, route traffic to replicas that contain the requested model, and evict model weights whenever new models are needed.

In short, Serve deployments would offer model multiplexing.

Proposal

Introduce a new deployment decorator, @serve.multiplex, to Ray Serve. You can pass the argument max_models_per_replicas to the decorator to control how many models can be loaded within each replica.

@serve.deployment(num_replicas=6)
class UserDeployment:
    def __init__(self, model_id: str):
        self._model = load_model(model_id)
        logger.info(f"loading model from {model_path}")

        @serve.multiplexed(max_num_models_per_replica=5)
        async def load_model(self, model_id: str) -> Any:
            # Load model with the given tag
            # You can use any model loading library here
            # and return the loaded model. load_from_s3 is
            # a placeholder function.
            return load_from_s3(model_id)

    def __call__(self, request: starlette.requests.Request):
            # Get the model_id from the request context.
            model_id = serve.get_multiplexed_model_id()
            # Load the model for the requested model_id.
            # If the model is already cached locally,
            # this will just be a dictionary lookup.
            model = await self.load_model(model_id)
            return model(request)

UserDeployment will have 6 replicas, each running at most 3 models.

To run this code

serve.run(UserDeployment.bind(), route_prefix=”/app”)

To query the endpoint, Ray Serve will use the model_id as tag to route the traffic

resp = requests.get(
    "http://127.0.0.1:8000/app", headers={"ray_serve_request_routing_tag": "1"}
)

When there’s no space left to load a new model in the cluster, Serve will use an LRU policy to evict existing models from an arbitrary replica and load the new model.

Autoscaling will still happen on the replica level. There’s no per-model autoscaling with this proposal. Replicas will be scaled up or down based on the QPS and request queue metrics.

There could potentially be more advanced options in the @serve.multiplex decorator, such as memory size limit, qps limit, etc. In this proposal, we start with the simplest one: setting the number of models per replica.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    RFCRFC issuesray-team-createdRay Team createdserveRay Serve Related IssuestaleThe issue is stale. It will be closed within 7 days unless there are further conversation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions