Add KVOffloadTier for fleet-managed KV cache offload backends

Once a deployment moves past single-pod intra-replica caching, the engine needs an offload tier — LMCache, Mooncake Store, NIXL multi-tier, SGLang HiCache backends, or Dynamo KVBM. All of these have the same operational shape: a DaemonSet per node + an optional control plane (etcd, Master service, Redis) + node-local NVMe and/or RDMA. All of them are platform-team concerns, not per-deployment knobs.

We need a fleet primitive that lets the platform team declare a KV offload backend once per InferenceCluster, and lets ML teams reference it by name from their deployment. Same pattern as ModelCache (#66) for weights, but for runtime state.

## Why this is fleet-level

**The backend is a node-level DaemonSet, not a pod-level concern.** Mooncake wants etcd + Master + per-node `mooncake_store` agents with RDMA NICs and hugepages. LMCache with NVMe wants per-node disk and an optional Redis cluster. NIXL needs RDMA device plugins and side channels. No ML team should be writing the DaemonSets themselves.

**Multiple deployments share the tier.** A single Mooncake/LMCache backend serves all the deployments on a cluster — the cache is bigger and the hit rate is higher when the working set is shared. Per-deployment backends defeat the purpose.

**Backend choice is hardware-coupled, not deployment-coupled.** Mooncake assumes RDMA. LMCache+3FS assumes a high-bandwidth shared filesystem. KVBM is NVIDIA-coupled. The platform team knows what the cluster has; the deployment shouldn't care.

**Configuration is environment-specific.** etcd endpoints, Redis hostnames, RDMA NIC names — these are cluster facts. Burying them in `engine.args` makes deployments non-portable.

## Sketch

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: KVOffloadTier
metadata:
  name: lmcache-h200
  namespace: modelplane-system
spec:
  inferenceClusterRef:
    name: gke-h200-prod
  backend: LMCache
  lmcache:
    cpuTierGiBPerNode: 64
    diskTier:
      storageClassName: nvme-local
      sizePerNode: 1Ti
    remote:
      redis:
        host: kv-redis.modelplane-system
        port: 6379
        secretRef:
          name: kv-redis-auth
```

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: llama-70b
spec:
  engine:
    name: vLLM
    image: vllm/vllm-openai:v0.11
    args:
    - "--model=meta-llama/Llama-3.3-70B-Instruct"
    kvCache:
      tierRef:
        name: lmcache-h200
```

The deployment names the tier; the composition function reads the KVOffloadTier and emits the per-pod env (`LMCACHE_USE_EXPERIMENTAL`, `LMCACHE_CONFIG_FILE`, ...), the ConfigMap mount holding the connector YAML, the Secret refs, and the right `--kv-transfer-config` in `engine.args`. No user-typed wiring.

## What composes

Per KVOffloadTier on the target cluster:

1. DaemonSet running the backend agent (lmcache server / `mooncake_store` / nixl agent)
2. Service for the agent's gRPC/REST port if needed
3. Optional control plane (Mooncake Master Deployment, etcd, Redis if the tier provisions it)
4. ConfigMap holding the per-engine connector YAML

Per ModelDeployment referencing it:

1. Env vars on every engine pod
2. Volume mounts for the ConfigMap and any Secret
3. The right `--kv-transfer-config` in `engine.args` (auto-injected)
4. RDMA `ResourceClaim` if the tier requires RDMA — paired with the DRA-alignment direction in #56

## Backend variants in scope for v0.3

- `LMCache` — in-process connector; per-node CPU + optional NVMe + optional Redis remote
- `Mooncake` — needs etcd + Master + per-node agents; RDMA-only
- `Nixl` — peer-to-peer transport; needs RDMA + side channels (no central store)
- `HiCache` — SGLang-specific; configures L3 backend via one of the above

Dynamo KVBM is its own composition path per #65 — Dynamo workers manage their own tier internally and don't go through KVOffloadTier.

## What we explicitly don't do

**Cross-cluster KV transfer.** Bandwidth math doesn't close for dense-attention models. 32K-context Llama 70B = ~10 GB KV; at 100 Gbps that's 800ms+RTT vs ~300–600ms to just re-prefill. Every frontier lab routes requests to where the cache lives instead. Fleet-level locality is request routing (#71), not state federation. Each cluster has its own tier.

**Replace ModelCache (#66).** Different artifact (runtime activations vs static bytes), different lifecycle (write-many continuously evicted vs write-once read-many), different storage substrate (node-local NVMe + RDMA vs RWX PVC). Keeping them as separate primitives matches the controller patterns.

**Engine-internal caching.** vLLM `--enable-prefix-caching`, SGLang radix tree, TRT-LLM block manager — all engine-internal. KVOffloadTier covers the cross-pod / cluster tier that engines reach out to.

## References

- #66 ModelCache — static weight staging, orthogonal primitive
- #71 ModelService routing affinity — cluster-locality routing, paired primitive
- #65 Drop KServe → llm-d/Dynamo — dropping KServe means we own the storage-initializer + tier wiring KServe partially absorbed
- #34 Disaggregated prefill/decode — NIXL transport is in scope here
- #56 DRA alignment — KVOffloadTier emits RDMA ResourceClaims when the backend needs them


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add KVOffloadTier for fleet-managed KV cache offload backends #72

Why this is fleet-level

Sketch

What composes

Backend variants in scope for v0.3

What we explicitly don't do

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add KVOffloadTier for fleet-managed KV cache offload backends #72

Description

Why this is fleet-level

Sketch

What composes

Backend variants in scope for v0.3

What we explicitly don't do

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions