Skip to content

Add routing affinity to ModelService for fleet-level cache locality #71

Description

@dennis-upbound

ModelService routes weighted across ModelEndpoints today. With replicas of one deployment landing on multiple clusters, the same multi-turn chat conversation can end up on a different cluster's replica each turn — cold KV cache every time. Turns 2-N pay full prefill cost (~800ms TTFT) when they could be hitting a warm cache (~150ms).

The fleet has the right information to fix this — Modelplane sees every ModelEndpoint across every cluster. It just doesn't bias requests toward cache locality today. This issue proposes adding routing affinity at the cluster granularity: same session or same prompt → same cluster. Per-replica KV-cache-aware routing inside each cluster is already llm-d's and Dynamo's job (per #65) and stays there.

What we want to get done

Two affinity modes on ModelService:

Sticky session. A header value routes the request to the same cluster every time. Useful when clients carry conversation IDs.

Prefix-hash. No header required from the client. A body-hasher computes a stable hash over the first N bytes of the prompt; the gateway consistent-hashes on that. Useful for OpenAI-API-compatible clients that don't pass session IDs.

apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2
  affinity:
    type: PrefixHash         # PrefixHash | Session | None
    prefixBytes: 1024        # bytes hashed for PrefixHash
    sessionHeader: X-Session-ID   # used for Session
    fallback: NextEndpoint

Cluster granularity, not replica granularity

The fleet gateway routes to clusters for locality. Within a cluster, the per-engine router (llm-d's inference scheduler for llm-d backends, Dynamo's frontend for Dynamo, per #65) handles the KV-cache-aware replica pick. We don't duplicate that work cross-cluster, we don't need cross-cluster cache-state tracking, and we don't tokenize at the fleet level.

The fleet gateway hashes a stable input and forwards. Same prompt body → same hash → same cluster → that cluster's router serves with a warm cache. If the cluster's cache is cold on the first request, it builds up naturally as subsequent matching requests reinforce.

How it composes

The composition function for ModelService emits Envoy resources behind whatever InferenceGateway is configured:

Session mode. Emit a BackendTrafficPolicy with loadBalancer.consistentHash keyed on the configured header. Nothing else needed.

PrefixHash mode. Emit the same BackendTrafficPolicy (hashing on X-Prompt-Hash), plus an EnvoyExtensionPolicy referencing a small ext_proc service that computes the hash from the request body, plus a Deployment running the ext_proc.

The ext_proc is small (read body, take first N bytes, xxhash to a header, forward). It's stateless, content-agnostic, engine-agnostic. Pull from upstream Envoy filter examples or ship a tiny binary alongside Modelplane.

What we explicitly don't build

Tokenizer-aware fleet gateway. Cluster routers tokenize properly using the model's vocab. We hash raw bytes — cluster-granularity locality is enough.

Cross-cluster KV transfer. GB of KV over WAN doesn't pay back inference time. Move requests, not state.

Replica-granularity routing from the fleet. llm-d and Dynamo do this within their clusters with real cache-state visibility. The fleet would be guessing.

A Modelplane-shipped consistent-hash library. Envoy ring_hash is the library — upstream, well-tested, already in any Envoy Gateway / Istio install.

Failure modes

Backend cluster unhealthy. Envoy outlier-detection ejects it; the hash ring rebalances; affected prefixes redistribute to other clusters and rebuild cache there. One cache miss per affected prefix.

Fleet scales (cluster added or removed). Consistent hashing with virtual nodes means only ~K/N of prefixes shift. Bounded disruption.

Hot key. A heavily-shared system prompt lands on one cluster that gets hammered while peers idle. Envoy supports a hybrid policy: consistent hash for first-pick, fall through to least-loaded if the picked cluster is over capacity. Configurable via the existing BackendTrafficPolicy shape.

References

Companion to #68 (scheduler integration) and #70 (capacity signal). Same delegation pattern — Modelplane composes generic Envoy / Gateway-API shapes, and per-engine and per-scheduler details stay contained to the layer where they belong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RoutingRouting component

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions