Skip to content

Explore inference-aware routing on the control plane gateway #8

Description

@negz

What problem are you facing?

The control plane's InferenceGateway uses basic path-based HTTPRoute routing across model placements. The Gateway API Inference Extension adds InferenceModel and InferencePool CRDs to Envoy Gateway with features like criticality-based priority and weighted traffic splitting between backends.

How could Modelplane help solve your problem?

The ModelDeployment function could compose InferenceModel and InferencePool resources on the control plane instead of (or alongside) plain HTTPRoutes. This would give us:

  • Criticality-based priority — Critical requests preempt BestEffort under load, useful for platform teams that want to prioritize production traffic over experimentation.
  • Weighted traffic splitting — Route a percentage of traffic to specific environments, enabling canary deployments across placements.

One design constraint is that model-name routing (single endpoint, model name in the request body selects the backend) doesn't fit cleanly in a multi-tenant platform — two teams deploying the same model would have the same model name but need separate endpoints. The current path-based approach (/<namespace>/<deployment>/v1/chat/completions) handles multi-tenancy correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoutingRouting componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions