Design lives in design/modelcache/ (branch dennis/modelcache-design). This issue tracks v0.1 implementation.
v0.1 scope
PVC backend, multi-node ready, no dedup. From the design doc:
ModelCache CRD with artifact discriminator (Weights, Tokenizer, Bytes)
- Sources:
huggingFace, s3, http, inline, configMap
PVC (RWX) storage backend with one-shot prefetch Job (absorbs #61)
replication: AllMatchingClusters — one RWX PVC per matching cluster, shared across all LWS gang pods
clusterSelector.matchLabels for cluster filtering
- Mount path intrinsic to the cache; deployments reference by name via
caches: [{ name }]
- Scheduling gated on per-cluster cache
Ready condition
- Fail-fast when a target cluster has no RWX storage class on
InferenceCluster.spec.storage.storageClassName
- Pluggable storage backend pattern shared with #72 KVOffloadTier
Out of scope (tracked separately)
LoraAdapter / Engine artifact kinds → v0.2
ContentAddressed backend (Modal-style tiered cache + lazy loading) → v0.2
- Cross-deployment / cross-tenant dedup → v0.2
gcs / azure / oci / pvc-clone sources → v0.2
AllMatchingNodes replication mode → v0.2
- Substrate unification #72 → v0.3
Roadmap detail in the design doc § v0.2 and § v0.3.
Examples
Nine (ModelCache + ModelDeployment) examples in design/modelcache/examples/: single-cluster basic, multi-node TensorPipeline gang, multi-cluster replication, separate tokenizer, private S3, opaque Bytes kind, plus three v0.2 previews.
References
- Design doc
- Examples
- #61 (closed) — RWX PVC mechanism
- #72
- PR #75 —
engine.env + imagePullSecrets; ModelCache rides on those for credential-bearing sources
Design lives in
design/modelcache/(branchdennis/modelcache-design). This issue tracks v0.1 implementation.v0.1 scope
PVC backend, multi-node ready, no dedup. From the design doc:
ModelCacheCRD with artifact discriminator (Weights,Tokenizer,Bytes)huggingFace,s3,http,inline,configMapPVC(RWX) storage backend with one-shot prefetch Job (absorbs #61)replication: AllMatchingClusters— one RWX PVC per matching cluster, shared across all LWS gang podsclusterSelector.matchLabelsfor cluster filteringcaches: [{ name }]ReadyconditionInferenceCluster.spec.storage.storageClassNameOut of scope (tracked separately)
LoraAdapter/Engineartifact kinds → v0.2ContentAddressedbackend (Modal-style tiered cache + lazy loading) → v0.2gcs/azure/oci/pvc-clonesources → v0.2AllMatchingNodesreplication mode → v0.2Roadmap detail in the design doc § v0.2 and § v0.3.
Examples
Nine (ModelCache + ModelDeployment) examples in
design/modelcache/examples/: single-cluster basic, multi-node TensorPipeline gang, multi-cluster replication, separate tokenizer, private S3, opaqueByteskind, plus three v0.2 previews.References
engine.env+imagePullSecrets; ModelCache rides on those for credential-bearing sources