Skip to content

Consider supporting individual deployments of the model-proxy and model-operator #430

@nstogner

Description

@nstogner

KubeAI consists of two primary components:

1. A model proxy: the KubeAI proxy provides an OpenAI-compatible API layer. Behind this API, the proxy implements a prefix-aware load balancing strategy that optimizes for KV the cache utilization of the backend serving engines (i.e. vLLM). The proxy also implements request queueing (while the system scales from zero replicas) and request retries (to seamlessly handle bad backends).

2. A model operator: the KubeAI model operator manages backend server instances (Pods) directly. It automates common operations such as downloading models, mounting volumes, loading dynamic LoRA adapters.

Both of these components are colocated in the same deployment to keep things simple. They integrate with each other to provide functionality like scale-from zero and dynamic LoRA routing. This integration is done via the Kubernetes API and it would be possible to deploy one without the other.

This issue is here to gather feedback on whether the proxy and the operator should be independently deployable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions