You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once a deployment moves past single-pod intra-replica caching, the engine needs an offload tier — LMCache, Mooncake Store, NIXL multi-tier, SGLang HiCache backends, or Dynamo KVBM. All of these have the same operational shape: a DaemonSet per node + an optional control plane (etcd, Master service, Redis) + node-local NVMe and/or RDMA. All of them are platform-team concerns, not per-deployment knobs.
We need a fleet primitive that lets the platform team declare a KV offload backend once per InferenceCluster, and lets ML teams reference it by name from their deployment. Same pattern as ModelCache (#66) for weights, but for runtime state.
Why this is fleet-level
The backend is a node-level DaemonSet, not a pod-level concern. Mooncake wants etcd + Master + per-node mooncake_store agents with RDMA NICs and hugepages. LMCache with NVMe wants per-node disk and an optional Redis cluster. NIXL needs RDMA device plugins and side channels. No ML team should be writing the DaemonSets themselves.
Multiple deployments share the tier. A single Mooncake/LMCache backend serves all the deployments on a cluster — the cache is bigger and the hit rate is higher when the working set is shared. Per-deployment backends defeat the purpose.
Backend choice is hardware-coupled, not deployment-coupled. Mooncake assumes RDMA. LMCache+3FS assumes a high-bandwidth shared filesystem. KVBM is NVIDIA-coupled. The platform team knows what the cluster has; the deployment shouldn't care.
Configuration is environment-specific. etcd endpoints, Redis hostnames, RDMA NIC names — these are cluster facts. Burying them in engine.args makes deployments non-portable.
The deployment names the tier; the composition function reads the KVOffloadTier and emits the per-pod env (LMCACHE_USE_EXPERIMENTAL, LMCACHE_CONFIG_FILE, ...), the ConfigMap mount holding the connector YAML, the Secret refs, and the right --kv-transfer-config in engine.args. No user-typed wiring.
What composes
Per KVOffloadTier on the target cluster:
DaemonSet running the backend agent (lmcache server / mooncake_store / nixl agent)
Service for the agent's gRPC/REST port if needed
Optional control plane (Mooncake Master Deployment, etcd, Redis if the tier provisions it)
ConfigMap holding the per-engine connector YAML
Per ModelDeployment referencing it:
Env vars on every engine pod
Volume mounts for the ConfigMap and any Secret
The right --kv-transfer-config in engine.args (auto-injected)
Nixl — peer-to-peer transport; needs RDMA + side channels (no central store)
HiCache — SGLang-specific; configures L3 backend via one of the above
Dynamo KVBM is its own composition path per #65 — Dynamo workers manage their own tier internally and don't go through KVOffloadTier.
What we explicitly don't do
Cross-cluster KV transfer. Bandwidth math doesn't close for dense-attention models. 32K-context Llama 70B = ~10 GB KV; at 100 Gbps that's 800ms+RTT vs ~300–600ms to just re-prefill. Every frontier lab routes requests to where the cache lives instead. Fleet-level locality is request routing (#71), not state federation. Each cluster has its own tier.
Replace ModelCache (#66). Different artifact (runtime activations vs static bytes), different lifecycle (write-many continuously evicted vs write-once read-many), different storage substrate (node-local NVMe + RDMA vs RWX PVC). Keeping them as separate primitives matches the controller patterns.
Engine-internal caching. vLLM --enable-prefix-caching, SGLang radix tree, TRT-LLM block manager — all engine-internal. KVOffloadTier covers the cross-pod / cluster tier that engines reach out to.
Once a deployment moves past single-pod intra-replica caching, the engine needs an offload tier — LMCache, Mooncake Store, NIXL multi-tier, SGLang HiCache backends, or Dynamo KVBM. All of these have the same operational shape: a DaemonSet per node + an optional control plane (etcd, Master service, Redis) + node-local NVMe and/or RDMA. All of them are platform-team concerns, not per-deployment knobs.
We need a fleet primitive that lets the platform team declare a KV offload backend once per InferenceCluster, and lets ML teams reference it by name from their deployment. Same pattern as ModelCache (#66) for weights, but for runtime state.
Why this is fleet-level
The backend is a node-level DaemonSet, not a pod-level concern. Mooncake wants etcd + Master + per-node
mooncake_storeagents with RDMA NICs and hugepages. LMCache with NVMe wants per-node disk and an optional Redis cluster. NIXL needs RDMA device plugins and side channels. No ML team should be writing the DaemonSets themselves.Multiple deployments share the tier. A single Mooncake/LMCache backend serves all the deployments on a cluster — the cache is bigger and the hit rate is higher when the working set is shared. Per-deployment backends defeat the purpose.
Backend choice is hardware-coupled, not deployment-coupled. Mooncake assumes RDMA. LMCache+3FS assumes a high-bandwidth shared filesystem. KVBM is NVIDIA-coupled. The platform team knows what the cluster has; the deployment shouldn't care.
Configuration is environment-specific. etcd endpoints, Redis hostnames, RDMA NIC names — these are cluster facts. Burying them in
engine.argsmakes deployments non-portable.Sketch
The deployment names the tier; the composition function reads the KVOffloadTier and emits the per-pod env (
LMCACHE_USE_EXPERIMENTAL,LMCACHE_CONFIG_FILE, ...), the ConfigMap mount holding the connector YAML, the Secret refs, and the right--kv-transfer-configinengine.args. No user-typed wiring.What composes
Per KVOffloadTier on the target cluster:
mooncake_store/ nixl agent)Per ModelDeployment referencing it:
--kv-transfer-configinengine.args(auto-injected)ResourceClaimif the tier requires RDMA — paired with the DRA-alignment direction in Align hardware capabilities design with Kubernetes Dynamic Resource Allocation #56Backend variants in scope for v0.3
LMCache— in-process connector; per-node CPU + optional NVMe + optional Redis remoteMooncake— needs etcd + Master + per-node agents; RDMA-onlyNixl— peer-to-peer transport; needs RDMA + side channels (no central store)HiCache— SGLang-specific; configures L3 backend via one of the aboveDynamo KVBM is its own composition path per #65 — Dynamo workers manage their own tier internally and don't go through KVOffloadTier.
What we explicitly don't do
Cross-cluster KV transfer. Bandwidth math doesn't close for dense-attention models. 32K-context Llama 70B = ~10 GB KV; at 100 Gbps that's 800ms+RTT vs ~300–600ms to just re-prefill. Every frontier lab routes requests to where the cache lives instead. Fleet-level locality is request routing (#71), not state federation. Each cluster has its own tier.
Replace ModelCache (#66). Different artifact (runtime activations vs static bytes), different lifecycle (write-many continuously evicted vs write-once read-many), different storage substrate (node-local NVMe + RDMA vs RWX PVC). Keeping them as separate primitives matches the controller patterns.
Engine-internal caching. vLLM
--enable-prefix-caching, SGLang radix tree, TRT-LLM block manager — all engine-internal. KVOffloadTier covers the cross-pod / cluster tier that engines reach out to.References