What problem are you facing?
The control plane's InferenceGateway uses basic path-based HTTPRoute routing across model placements. The Gateway API Inference Extension adds InferenceModel and InferencePool CRDs to Envoy Gateway with features like criticality-based priority and weighted traffic splitting between backends.
How could Modelplane help solve your problem?
The ModelDeployment function could compose InferenceModel and InferencePool resources on the control plane instead of (or alongside) plain HTTPRoutes. This would give us:
- Criticality-based priority — Critical requests preempt BestEffort under load, useful for platform teams that want to prioritize production traffic over experimentation.
- Weighted traffic splitting — Route a percentage of traffic to specific environments, enabling canary deployments across placements.
One design constraint is that model-name routing (single endpoint, model name in the request body selects the backend) doesn't fit cleanly in a multi-tenant platform — two teams deploying the same model would have the same model name but need separate endpoints. The current path-based approach (/<namespace>/<deployment>/v1/chat/completions) handles multi-tenancy correctly.
What problem are you facing?
The control plane's InferenceGateway uses basic path-based HTTPRoute routing across model placements. The Gateway API Inference Extension adds
InferenceModelandInferencePoolCRDs to Envoy Gateway with features like criticality-based priority and weighted traffic splitting between backends.How could Modelplane help solve your problem?
The ModelDeployment function could compose
InferenceModelandInferencePoolresources on the control plane instead of (or alongside) plain HTTPRoutes. This would give us:One design constraint is that model-name routing (single endpoint, model name in the request body selects the backend) doesn't fit cleanly in a multi-tenant platform — two teams deploying the same model would have the same model name but need separate endpoints. The current path-based approach (
/<namespace>/<deployment>/v1/chat/completions) handles multi-tenancy correctly.