-
Notifications
You must be signed in to change notification settings - Fork 119
Description
KubeAI consists of two primary components:
1. A model proxy: the KubeAI proxy provides an OpenAI-compatible API layer. Behind this API, the proxy implements a prefix-aware load balancing strategy that optimizes for KV the cache utilization of the backend serving engines (i.e. vLLM). The proxy also implements request queueing (while the system scales from zero replicas) and request retries (to seamlessly handle bad backends).
2. A model operator: the KubeAI model operator manages backend server instances (Pods) directly. It automates common operations such as downloading models, mounting volumes, loading dynamic LoRA adapters.
Both of these components are colocated in the same deployment to keep things simple. They integrate with each other to provide functionality like scale-from zero and dynamic LoRA routing. This integration is done via the Kubernetes API and it would be possible to deploy one without the other.
This issue is here to gather feedback on whether the proxy and the operator should be independently deployable.