Skip to content
Closed
Due by June 9, 2026
Closed Jul 3, 2026

Prove fleet-aware inference orchestration works. A single Modelplane instance manages an inference fleet across clouds, and on-premise Kubernetes clusters. Replicas are placed by hardware capability and topology, scaled fleet-wide, and served behind a basic gateway. Architecture is real; platform team-facing experience is real; surface area is intentionally small.

Scope:

  • Cluster registration across AWS, GCP, Azure and BYO clusters via kubeconfig.
  • InferenceClass with capability matching for hardware-aware scheduling; default catalog of common cloud and SKU combinations
  • ModelDeployment with engine as container model (engine name, version, image, args, env, imagePullSecrets grouped under engine)
  • Multi-engine support: vLLM, SGLang, TGI, NIM
  • Three replica topologies (Single, LeaderWorker, Disaggregated) composed automatically to the lightest backend (native Deployment, llm-d LeaderWorkerSet, or Dynamo)
  • InferenceGateway with health-aware and weighted routing across clusters
  • ModelService and ModelEndpoint for routing across replicas, includes load balancing across replicas.
  • Fleet-wide autoscaling via KEDA on the scale subresource
  • DRA as the universal device binding mechanism
100% complete

List view

    There are no open issues in this milestone

    Add issues to milestones to help organize your work for a particular release or project. Find and add issues with no milestones in this repo.