Skip to content

Add KVOffloadTier for fleet-managed KV cache offload backends #72

Description

@dennis-upbound

Once a deployment moves past single-pod intra-replica caching, the engine needs an offload tier — LMCache, Mooncake Store, NIXL multi-tier, SGLang HiCache backends, or Dynamo KVBM. All of these have the same operational shape: a DaemonSet per node + an optional control plane (etcd, Master service, Redis) + node-local NVMe and/or RDMA. All of them are platform-team concerns, not per-deployment knobs.

We need a fleet primitive that lets the platform team declare a KV offload backend once per InferenceCluster, and lets ML teams reference it by name from their deployment. Same pattern as ModelCache (#66) for weights, but for runtime state.

Why this is fleet-level

The backend is a node-level DaemonSet, not a pod-level concern. Mooncake wants etcd + Master + per-node mooncake_store agents with RDMA NICs and hugepages. LMCache with NVMe wants per-node disk and an optional Redis cluster. NIXL needs RDMA device plugins and side channels. No ML team should be writing the DaemonSets themselves.

Multiple deployments share the tier. A single Mooncake/LMCache backend serves all the deployments on a cluster — the cache is bigger and the hit rate is higher when the working set is shared. Per-deployment backends defeat the purpose.

Backend choice is hardware-coupled, not deployment-coupled. Mooncake assumes RDMA. LMCache+3FS assumes a high-bandwidth shared filesystem. KVBM is NVIDIA-coupled. The platform team knows what the cluster has; the deployment shouldn't care.

Configuration is environment-specific. etcd endpoints, Redis hostnames, RDMA NIC names — these are cluster facts. Burying them in engine.args makes deployments non-portable.

Sketch

apiVersion: modelplane.ai/v1alpha1
kind: KVOffloadTier
metadata:
  name: lmcache-h200
  namespace: modelplane-system
spec:
  inferenceClusterRef:
    name: gke-h200-prod
  backend: LMCache
  lmcache:
    cpuTierGiBPerNode: 64
    diskTier:
      storageClassName: nvme-local
      sizePerNode: 1Ti
    remote:
      redis:
        host: kv-redis.modelplane-system
        port: 6379
        secretRef:
          name: kv-redis-auth
apiVersion: modelplane.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: llama-70b
spec:
  engine:
    name: vLLM
    image: vllm/vllm-openai:v0.11
    args:
    - "--model=meta-llama/Llama-3.3-70B-Instruct"
    kvCache:
      tierRef:
        name: lmcache-h200

The deployment names the tier; the composition function reads the KVOffloadTier and emits the per-pod env (LMCACHE_USE_EXPERIMENTAL, LMCACHE_CONFIG_FILE, ...), the ConfigMap mount holding the connector YAML, the Secret refs, and the right --kv-transfer-config in engine.args. No user-typed wiring.

What composes

Per KVOffloadTier on the target cluster:

  1. DaemonSet running the backend agent (lmcache server / mooncake_store / nixl agent)
  2. Service for the agent's gRPC/REST port if needed
  3. Optional control plane (Mooncake Master Deployment, etcd, Redis if the tier provisions it)
  4. ConfigMap holding the per-engine connector YAML

Per ModelDeployment referencing it:

  1. Env vars on every engine pod
  2. Volume mounts for the ConfigMap and any Secret
  3. The right --kv-transfer-config in engine.args (auto-injected)
  4. RDMA ResourceClaim if the tier requires RDMA — paired with the DRA-alignment direction in Align hardware capabilities design with Kubernetes Dynamic Resource Allocation #56

Backend variants in scope for v0.3

  • LMCache — in-process connector; per-node CPU + optional NVMe + optional Redis remote
  • Mooncake — needs etcd + Master + per-node agents; RDMA-only
  • Nixl — peer-to-peer transport; needs RDMA + side channels (no central store)
  • HiCache — SGLang-specific; configures L3 backend via one of the above

Dynamo KVBM is its own composition path per #65 — Dynamo workers manage their own tier internally and don't go through KVOffloadTier.

What we explicitly don't do

Cross-cluster KV transfer. Bandwidth math doesn't close for dense-attention models. 32K-context Llama 70B = ~10 GB KV; at 100 Gbps that's 800ms+RTT vs ~300–600ms to just re-prefill. Every frontier lab routes requests to where the cache lives instead. Fleet-level locality is request routing (#71), not state federation. Each cluster has its own tier.

Replace ModelCache (#66). Different artifact (runtime activations vs static bytes), different lifecycle (write-many continuously evicted vs write-once read-many), different storage substrate (node-local NVMe + RDMA vs RWX PVC). Keeping them as separate primitives matches the controller patterns.

Engine-internal caching. vLLM --enable-prefix-caching, SGLang radix tree, TRT-LLM block manager — all engine-internal. KVOffloadTier covers the cross-pod / cluster tier that engines reach out to.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    CachingCaching componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions