ModelService routes weighted across ModelEndpoints today. With replicas of one deployment landing on multiple clusters, the same multi-turn chat conversation can end up on a different cluster's replica each turn — cold KV cache every time. Turns 2-N pay full prefill cost (~800ms TTFT) when they could be hitting a warm cache (~150ms).
The fleet has the right information to fix this — Modelplane sees every ModelEndpoint across every cluster. It just doesn't bias requests toward cache locality today. This issue proposes adding routing affinity at the cluster granularity: same session or same prompt → same cluster. Per-replica KV-cache-aware routing inside each cluster is already llm-d's and Dynamo's job (per #65) and stays there.
What we want to get done
Two affinity modes on ModelService:
Sticky session. A header value routes the request to the same cluster every time. Useful when clients carry conversation IDs.
Prefix-hash. No header required from the client. A body-hasher computes a stable hash over the first N bytes of the prompt; the gateway consistent-hashes on that. Useful for OpenAI-API-compatible clients that don't pass session IDs.
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
name: kimi-k2
spec:
endpoints:
- selector:
matchLabels:
modelplane.ai/deployment: kimi-k2
affinity:
type: PrefixHash # PrefixHash | Session | None
prefixBytes: 1024 # bytes hashed for PrefixHash
sessionHeader: X-Session-ID # used for Session
fallback: NextEndpoint
Cluster granularity, not replica granularity
The fleet gateway routes to clusters for locality. Within a cluster, the per-engine router (llm-d's inference scheduler for llm-d backends, Dynamo's frontend for Dynamo, per #65) handles the KV-cache-aware replica pick. We don't duplicate that work cross-cluster, we don't need cross-cluster cache-state tracking, and we don't tokenize at the fleet level.
The fleet gateway hashes a stable input and forwards. Same prompt body → same hash → same cluster → that cluster's router serves with a warm cache. If the cluster's cache is cold on the first request, it builds up naturally as subsequent matching requests reinforce.
How it composes
The composition function for ModelService emits Envoy resources behind whatever InferenceGateway is configured:
Session mode. Emit a BackendTrafficPolicy with loadBalancer.consistentHash keyed on the configured header. Nothing else needed.
PrefixHash mode. Emit the same BackendTrafficPolicy (hashing on X-Prompt-Hash), plus an EnvoyExtensionPolicy referencing a small ext_proc service that computes the hash from the request body, plus a Deployment running the ext_proc.
The ext_proc is small (read body, take first N bytes, xxhash to a header, forward). It's stateless, content-agnostic, engine-agnostic. Pull from upstream Envoy filter examples or ship a tiny binary alongside Modelplane.
What we explicitly don't build
Tokenizer-aware fleet gateway. Cluster routers tokenize properly using the model's vocab. We hash raw bytes — cluster-granularity locality is enough.
Cross-cluster KV transfer. GB of KV over WAN doesn't pay back inference time. Move requests, not state.
Replica-granularity routing from the fleet. llm-d and Dynamo do this within their clusters with real cache-state visibility. The fleet would be guessing.
A Modelplane-shipped consistent-hash library. Envoy ring_hash is the library — upstream, well-tested, already in any Envoy Gateway / Istio install.
Failure modes
Backend cluster unhealthy. Envoy outlier-detection ejects it; the hash ring rebalances; affected prefixes redistribute to other clusters and rebuild cache there. One cache miss per affected prefix.
Fleet scales (cluster added or removed). Consistent hashing with virtual nodes means only ~K/N of prefixes shift. Bounded disruption.
Hot key. A heavily-shared system prompt lands on one cluster that gets hammered while peers idle. Envoy supports a hybrid policy: consistent hash for first-pick, fall through to least-loaded if the picked cluster is over capacity. Configurable via the existing BackendTrafficPolicy shape.
References
Companion to #68 (scheduler integration) and #70 (capacity signal). Same delegation pattern — Modelplane composes generic Envoy / Gateway-API shapes, and per-engine and per-scheduler details stay contained to the layer where they belong.
ModelServiceroutes weighted acrossModelEndpoints today. With replicas of one deployment landing on multiple clusters, the same multi-turn chat conversation can end up on a different cluster's replica each turn — cold KV cache every time. Turns 2-N pay full prefill cost (~800ms TTFT) when they could be hitting a warm cache (~150ms).The fleet has the right information to fix this — Modelplane sees every
ModelEndpointacross every cluster. It just doesn't bias requests toward cache locality today. This issue proposes adding routing affinity at the cluster granularity: same session or same prompt → same cluster. Per-replica KV-cache-aware routing inside each cluster is already llm-d's and Dynamo's job (per #65) and stays there.What we want to get done
Two affinity modes on
ModelService:Sticky session. A header value routes the request to the same cluster every time. Useful when clients carry conversation IDs.
Prefix-hash. No header required from the client. A body-hasher computes a stable hash over the first N bytes of the prompt; the gateway consistent-hashes on that. Useful for OpenAI-API-compatible clients that don't pass session IDs.
Cluster granularity, not replica granularity
The fleet gateway routes to clusters for locality. Within a cluster, the per-engine router (llm-d's inference scheduler for llm-d backends, Dynamo's frontend for Dynamo, per #65) handles the KV-cache-aware replica pick. We don't duplicate that work cross-cluster, we don't need cross-cluster cache-state tracking, and we don't tokenize at the fleet level.
The fleet gateway hashes a stable input and forwards. Same prompt body → same hash → same cluster → that cluster's router serves with a warm cache. If the cluster's cache is cold on the first request, it builds up naturally as subsequent matching requests reinforce.
How it composes
The composition function for
ModelServiceemits Envoy resources behind whatever InferenceGateway is configured:Session mode. Emit a
BackendTrafficPolicywithloadBalancer.consistentHashkeyed on the configured header. Nothing else needed.PrefixHash mode. Emit the same
BackendTrafficPolicy(hashing onX-Prompt-Hash), plus anEnvoyExtensionPolicyreferencing a small ext_proc service that computes the hash from the request body, plus aDeploymentrunning the ext_proc.The ext_proc is small (read body, take first N bytes, xxhash to a header, forward). It's stateless, content-agnostic, engine-agnostic. Pull from upstream Envoy filter examples or ship a tiny binary alongside Modelplane.
What we explicitly don't build
Tokenizer-aware fleet gateway. Cluster routers tokenize properly using the model's vocab. We hash raw bytes — cluster-granularity locality is enough.
Cross-cluster KV transfer. GB of KV over WAN doesn't pay back inference time. Move requests, not state.
Replica-granularity routing from the fleet. llm-d and Dynamo do this within their clusters with real cache-state visibility. The fleet would be guessing.
A Modelplane-shipped consistent-hash library. Envoy
ring_hashis the library — upstream, well-tested, already in any Envoy Gateway / Istio install.Failure modes
Backend cluster unhealthy. Envoy outlier-detection ejects it; the hash ring rebalances; affected prefixes redistribute to other clusters and rebuild cache there. One cache miss per affected prefix.
Fleet scales (cluster added or removed). Consistent hashing with virtual nodes means only ~K/N of prefixes shift. Bounded disruption.
Hot key. A heavily-shared system prompt lands on one cluster that gets hammered while peers idle. Envoy supports a hybrid policy: consistent hash for first-pick, fall through to least-loaded if the picked cluster is over capacity. Configurable via the existing
BackendTrafficPolicyshape.References
Companion to #68 (scheduler integration) and #70 (capacity signal). Same delegation pattern — Modelplane composes generic Envoy / Gateway-API shapes, and per-engine and per-scheduler details stay contained to the layer where they belong.