GitHub

Prove fleet-aware inference orchestration works. A single Modelplane instance manages an inference fleet across clouds, and on-premise Kubernetes clusters. Replicas are placed by hardware capability and topology, scaled fleet-wide, and served behind a basic gateway. Architecture is real; platform team-facing experience is real; surface area is intentionally small.

Scope:

Cluster registration across AWS, GCP, Azure and BYO clusters via kubeconfig.
InferenceClass with capability matching for hardware-aware scheduling; default catalog of common cloud and SKU combinations
ModelDeployment with engine as container model (engine name, version, image, args, env, imagePullSecrets grouped under engine)
Multi-engine support: vLLM, SGLang, TGI, NIM
Three replica topologies (Single, LeaderWorker, Disaggregated) composed automatically to the lightest backend (native Deployment, llm-d LeaderWorkerSet, or Dynamo)
InferenceGateway with health-aware and weighted routing across clusters
ModelService and ModelEndpoint for routing across replicas, includes load balancing across replicas.
Fleet-wide autoscaling via KEDA on the scale subresource
DRA as the universal device binding mechanism

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1

There are no open issues in this milestone

Uh oh!

v0.1

List view

There are no open issues in this milestone