Add routing affinity to ModelService for fleet-level cache locality

`ModelService` routes weighted across `ModelEndpoint`s today. With replicas of one deployment landing on multiple clusters, the same multi-turn chat conversation can end up on a different cluster's replica each turn — cold KV cache every time. Turns 2-N pay full prefill cost (~800ms TTFT) when they could be hitting a warm cache (~150ms).

The fleet has the right information to fix this — Modelplane sees every `ModelEndpoint` across every cluster. It just doesn't bias requests toward cache locality today. This issue proposes adding routing affinity at the cluster granularity: same session or same prompt → same cluster. Per-replica KV-cache-aware routing inside each cluster is already llm-d's and Dynamo's job (per #65) and stays there.

## What we want to get done

Two affinity modes on `ModelService`:

**Sticky session.** A header value routes the request to the same cluster every time. Useful when clients carry conversation IDs.

**Prefix-hash.** No header required from the client. A body-hasher computes a stable hash over the first N bytes of the prompt; the gateway consistent-hashes on that. Useful for OpenAI-API-compatible clients that don't pass session IDs.

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ModelService
metadata:
  name: kimi-k2
spec:
  endpoints:
  - selector:
      matchLabels:
        modelplane.ai/deployment: kimi-k2
  affinity:
    type: PrefixHash         # PrefixHash | Session | None
    prefixBytes: 1024        # bytes hashed for PrefixHash
    sessionHeader: X-Session-ID   # used for Session
    fallback: NextEndpoint
```

## Cluster granularity, not replica granularity

The fleet gateway routes to clusters for locality. Within a cluster, the per-engine router (llm-d's inference scheduler for llm-d backends, Dynamo's frontend for Dynamo, per #65) handles the KV-cache-aware replica pick. We don't duplicate that work cross-cluster, we don't need cross-cluster cache-state tracking, and we don't tokenize at the fleet level.

The fleet gateway hashes a stable input and forwards. Same prompt body → same hash → same cluster → that cluster's router serves with a warm cache. If the cluster's cache is cold on the first request, it builds up naturally as subsequent matching requests reinforce.

## How it composes

The composition function for `ModelService` emits Envoy resources behind whatever InferenceGateway is configured:

**Session mode.** Emit a `BackendTrafficPolicy` with `loadBalancer.consistentHash` keyed on the configured header. Nothing else needed.

**PrefixHash mode.** Emit the same `BackendTrafficPolicy` (hashing on `X-Prompt-Hash`), plus an `EnvoyExtensionPolicy` referencing a small ext_proc service that computes the hash from the request body, plus a `Deployment` running the ext_proc.

The ext_proc is small (read body, take first N bytes, xxhash to a header, forward). It's stateless, content-agnostic, engine-agnostic. Pull from upstream Envoy filter examples or ship a tiny binary alongside Modelplane.

## What we explicitly don't build

Tokenizer-aware fleet gateway. Cluster routers tokenize properly using the model's vocab. We hash raw bytes — cluster-granularity locality is enough.

Cross-cluster KV transfer. GB of KV over WAN doesn't pay back inference time. Move requests, not state.

Replica-granularity routing from the fleet. llm-d and Dynamo do this within their clusters with real cache-state visibility. The fleet would be guessing.

A Modelplane-shipped consistent-hash library. Envoy `ring_hash` is the library — upstream, well-tested, already in any Envoy Gateway / Istio install.

## Failure modes

**Backend cluster unhealthy.** Envoy outlier-detection ejects it; the hash ring rebalances; affected prefixes redistribute to other clusters and rebuild cache there. One cache miss per affected prefix.

**Fleet scales (cluster added or removed).** Consistent hashing with virtual nodes means only ~K/N of prefixes shift. Bounded disruption.

**Hot key.** A heavily-shared system prompt lands on one cluster that gets hammered while peers idle. Envoy supports a hybrid policy: consistent hash for first-pick, fall through to least-loaded if the picked cluster is over capacity. Configurable via the existing `BackendTrafficPolicy` shape.

## References

Companion to #68 (scheduler integration) and #70 (capacity signal). Same delegation pattern — Modelplane composes generic Envoy / Gateway-API shapes, and per-engine and per-scheduler details stay contained to the layer where they belong.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add routing affinity to ModelService for fleet-level cache locality #71

What we want to get done

Cluster granularity, not replica granularity

How it composes

What we explicitly don't build

Failure modes

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add routing affinity to ModelService for fleet-level cache locality #71

Description

What we want to get done

Cluster granularity, not replica granularity

How it composes

What we explicitly don't build

Failure modes

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions