-
Notifications
You must be signed in to change notification settings - Fork 14
Closed Jul 3, 2026
Due by June 9, 2026
•Closed Prove fleet-aware inference orchestration works. A single Modelplane instance manages an inference fleet across clouds, and on-premise Kubernetes clusters. Replicas are placed by hardware capability and topology, scaled fleet-wide, and served behind a basic gateway. Architecture is real; platform team-facing experience is real; surface area is intentionally small.
Scope:
- Cluster registration across AWS, GCP, Azure and BYO clusters via kubeconfig.
- InferenceClass with capability matching for hardware-aware scheduling; default catalog of common cloud and SKU combinations
- ModelDeployment with engine as container model (engine name, version, image, args, env, imagePullSecrets grouped under engine)
- Multi-engine support: vLLM, SGLang, TGI, NIM
- Three replica topologies (Single, LeaderWorker, Disaggregated) composed automatically to the lightest backend (native Deployment, llm-d LeaderWorkerSet, or Dynamo)
- InferenceGateway with health-aware and weighted routing across clusters
- ModelService and ModelEndpoint for routing across replicas, includes load balancing across replicas.
- Fleet-wide autoscaling via KEDA on the scale subresource
- DRA as the universal device binding mechanism
100% complete
List view
0 of 0 selected 0 issues of 0 selected
There are no open issues in this milestone
Add issues to milestones to help organize your work for a particular release or project. Find and add issues with no milestones in this repo.