Pre-stage model weights with a ModelCache primitive by dennis-upbound · Pull Request #112 · modelplaneai/modelplane

dennis-upbound · 2026-06-08T17:52:39Z

Fixes #66.

Serving pods fetched model weights from HuggingFace at every boot. For frontier models that's tens to hundreds of GB pulled on every pod start, and in a multi-node LeaderWorkerSet gang each pod pulls independently — so an 8-pod gang downloads the same 600GB eight times before it can serve. There was no way to stage weights once and share them.

This adds ModelCache, a v0.1 primitive that pre-stages a HuggingFace model onto a per-cluster ReadWriteMany PVC via a one-shot hydration Job, so serving pods mount the weights instead of downloading them. A deployment opts in with spec.modelCacheRef; the cache PVC mounts at /mnt/models on every serving pod, shared across the whole gang, and the weights are downloaded once per cluster and read N times.

The compose-model-cache function fans a ModelCache out to every matched InferenceCluster: a RWX PVC sized from the source, plus a hydration Job that runs hf download into it. Hydration is re-run safe via a completion marker, so an interrupted download resumes rather than serving truncated weights. Per-cluster phase (Pending/Hydrating/Ready/Failed) and an x/y ready summary are reported on the cache's status.

modelCacheRef now threads from ModelDeployment through to each ModelReplica, and the native and llm-d backends mount the cache. The mount is engine-agnostic — it lands on the native Deployment pod and on both the leader and worker templates of an llm-d gang. --model=/mnt/models is injected only for the turnkey vLLM path; a bring-your-own engine like SGLang sets its own --model-path, so injection is skipped when the engine supplies its own command.

On GKE the modelplane-rwx storage class self-provisions: compose-gke-cluster enables the Filestore API and compose-inference-cluster composes a VPC-pinned Filestore StorageClass. On EKS the cache works against an admin-provided EFS StorageClass (modelplane-rwx-efs); auto-provisioning EFS is a separate follow-up.

Before, a deployment had to bake the model into the engine args and re-pull every boot:

spec:
  workers:
    template:
      spec:
        containers:
        - name: engine
          args: ["--model=Qwen/Qwen3-0.6B"]   # fetched from HF on every pod start

After, the cache is staged once and mounted:

spec:
  modelCacheRef:
    name: qwen
  workers:
    template:
      spec:
        containers:
        - name: engine
          args: ["--model=/mnt/models"]       # read from the mounted PVC

Scope is locked to the HuggingFace source plus a Modelplane-managed PVC, matching the merged XRD. EKS works against an admin-provided EFS StorageClass; auto-provisioning EFS is a separate follow-up.

How it was validated so far: unit tests for every changed function (the cache function's PVC/Job/status/conditions, the deployment→replica propagation, and the backend mounts including the vLLM-inject vs SGLang-skip split). Live cluster validation is pending.

I have:

Read and followed Modelplane's contribution process.
~~Run nix flake check and made sure it passes.~~ Per-function unit tests pass and ruff lint/format is clean locally; the full sandboxed nix flake check runs in CI on this PR.
Added or updated tests covering the composition function changes.
Signed off every commit with git commit -s.

negz

Thanks @dennis-upbound!

really like the PR description. I found it way easier to read. Did you get an agent to generate it using the new guidance in CONTRIBUTING.md?

The ModelCache XRD merged with a definition but no composition or composition function, so applying a ModelCache produced an XR that never reconciled into anything. Add the compose-model-cache function package (mirroring the compose-model-replica layout), a Pipeline composition under apis/modelcaches that references it, and the function tarball entry in crossplane-project.yaml. The function carries the full Composer skeleton: it parses the XR, guards against an unset source, and calls through a fixed pipeline of stubs that later tasks replace with real cluster matching, PVC/Job composition, and status reporting. The REMOTE_NS / PVC-naming constants and comments document the cross- function contract with the serving backends. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The skeleton Composer matched no clusters and composed nothing: its resolve_inputs/match_clusters/compose_cluster_resources were no-op stubs, so a ModelCache never staged its weights anywhere. Resolve the InferenceCluster required-resource set (gated on the requirement key's presence, since get_required_resources returns [] for both the unresolved and resolved-empty cases), keep the clusters that have finished provisioning (providerConfigRef set), and emit a ReadWriteMany PVC per matched cluster wrapped in a provider-kubernetes Object pointed at that cluster's ClusterProviderConfig. The PVC is named modelcache-<namespace>-<name> (truncated to 63) so caches of the same name from different Modelplane namespaces don't collide in the workload cluster's default namespace, matching the name the serving backends will compute. Its storage class comes from the cluster's per-source cache block, falling back to the source's XRD default (GKE -> modelplane-rwx, EKS -> modelplane-rwx-efs) since Pydantic doesn't apply the nested default when the cache block is omitted entirely. Resources are always emitted for a matched cluster, never gated on readiness: omitting an Object tells Crossplane to delete it, which would re-trigger hydration on every dependency flap. The hydration Job's manifest is a placeholder here and gets its real HuggingFace download in the next change. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Serving pods re-pulled model weights from HuggingFace at boot, and the ModelCache function carried only a placeholder Job manifest, so the RWX PVC composed per cluster was never populated. This composes the real one-shot hydration Job: it pip-installs huggingface_hub and runs `hf download <repo>[ --revision X] --local-dir /mnt/artifact` into the cache PVC, wiring HF_TOKEN from the optional authSecret. Idempotency uses a completion marker (.modelplane-hydrated) touched only after a successful download under `set -e`, and the Job skips when the marker is present. Keying on the marker rather than directory emptiness makes re-runs safe: an interrupted pull leaves files but no marker, so a retry resumes (hf download is resumable) instead of falsely concluding the cache is complete and serving truncated weights. It also sidesteps the Filestore lost+found directory that broke a bare emptiness check. Uses `hf download`, not the removed huggingface-cli (dropped in huggingface-hub 1.x), which previously killed the Job at install. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The function composed the PVC and hydration Job per cluster but never reported whether the artifact had actually staged: status, conditions, and the composite's readiness flag were no-op stubs, so a ModelCache showed no phase and downstream waiters had nothing to gate on. Derive each cluster's phase from the remote PVC/Job status the provider echoes back under Object.status.atProvider.manifest.status — PVC Bound plus Job succeeded is Ready, PVC Bound alone is Hydrating, a failed Job condition is Failed, otherwise Pending. Write a per-cluster status with an "x/y" ready summary, set ArtifactReady (and the composite ready flag) only when every matched cluster is Ready, and mark the PVC/Job Objects ready after compose so update() doesn't reset the flag. Emit one-time transition events on first compose and on reaching all-ready to keep describe output quiet. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Phase 1 review found two gaps. The mark_ready_resources docstring claimed it must run after resource.update() because update() resets the ready flag, but update() only writes the protobuf .resource field and never touches the sibling .ready field — the real ordering reason is that the desired entries must be composed first. The test suite also left two derive_cluster_phase/derive_conditions branches uncovered: a cluster whose hydration Job failed, and partial readiness across two clusters where one is Ready and one still Hydrating. Reword the comment to state the actual ordering reason and add tests for the Failed phase (ready 0/1, XR not ready) and the Partial case (ready 1/2, ArtifactReady False/Partial, XR not ready). Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

compose_replicas built each replica's SpecModel with only clusterName and workers, dropping the deployment's modelCacheRef. Replicas therefore never learned which cache to mount, so the backend had no way to know a ModelCache should back the workload. Thread modelCacheRef through: when the deployment sets spec.modelCacheRef, the composed replica's spec carries mrv1alpha1.ModelCacheRef(name=...). The ref is only emitted when set, so deployments without a cache compose unchanged. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

With KServe dropped, nothing mounts a ModelCache PVC into the serving pods or points the engine at it, and vLLM falls back to serving facebook/opt-125m when no model path is given. The serving backends (native, llm-d) need a single, consistent way to wire a referenced cache into the engine. Add cache_pvc_name(), cache_mounts(), and apply_cache_args() to backends.base, plus the CACHE_MOUNT_PATH, _CACHE_VOLUME, and PVC_NAME_PREFIX constants. cache_pvc_name derives the workload PVC name as f"modelcache-{namespace}-{name}"[:63], identical to compose-model-cache's _pvc_name(), so serving pods mount the claim the cache actually created. cache_mounts returns the RWX PVC volume and a read-write mount at /mnt/models (engines write tokenizer/compile/lock artifacts, so a readOnly mount would hard-fail them). apply_cache_args injects --model=/mnt/models only for the turnkey vLLM path: it is skipped when no cache is referenced, when the engine brings its own command (a non-vLLM engine like SGLang owns its args and uses --model-path), or when the user already set --model. These helpers are additive; Tasks 7 and 8 wire them into the backends. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The native single-pod backend ignored a replica's modelCacheRef: the engine pod had no volume for the cache PVC and no mount, so a deployment asking to serve from a warmed cache would instead fetch weights from their source at startup (or fail to find them). Wire the shared cache helpers into NativeBackend.build: cache_mounts adds the model-cache volume and its /mnt/models mount when a cache is referenced (empty lists otherwise, so the no-cache path is byte-for-byte unchanged), and apply_cache_args fills in --model=/mnt/models only when the engine hasn't set it and has no command of its own — leaving a single-pod SGLang's --model-path intact. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The native backend mounts a referenced ModelCache, but the multi-node llm-d backend did not, so a gang-scheduled replica would start its LeaderWorkerSet pods with no /mnt/models and fall back to fetching weights from source on every node — defeating the cache and risking divergent shards across the gang. Thread base.cache_mounts(replica) through the inner container() and pod_spec() builders so the cache volume and /mnt/models mount land on both the leader and worker templates; every node of the gang loads its shard from the shared RWX PVC. For the turnkey vLLM bootstrap, also run base.apply_cache_args over the leader command's args so --model defaults to the mount when absent. Leave the bring-your-own command path (SGLang etc.) untouched: it sets --model-path itself, and apply_cache_args no-ops when the engine has a command, so injecting --model would only corrupt a verbatim user command. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

Fresh GCP projects have file.googleapis.com disabled, so the Filestore CSI addon cannot provision the RWX volumes that ModelCache relies on: cache PVCs sit Pending and provisioning fails with SERVICE_DISABLED. Compose a ProjectService alongside the GKE networking that enables file.googleapis.com for the cluster's project, with disableOnDestroy false so tearing down a cluster does not disable the API for other workloads in the project. Track it in mark_readiness alongside the other managed resources. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

A ModelCache PVC on GKE needs a ReadWriteMany StorageClass. The GKE Filestore CSI driver provisions RWX volumes, but it defaults to the `default` VPC, so a PVC on a Modelplane-provisioned cluster hangs Pending. The StorageClass must pin parameters.network to the cluster's own VPC. That network name can't be derived from the XR: compose-gke-cluster composes the VPC Network without a fixed name, so Crossplane gives it a provider-assigned suffix and the real GCP network is <name>-<suffix>. Pinning to the bare XR name fails with "network '<name>' does not exist" (verified live on GKE), which would defeat the StorageClass — it exists precisely to keep PVCs off the default VPC. compose-gke-cluster now reads the composed Network's external-name once observed and reports it on GKECluster.status.network.name. compose-inference-cluster reads that and composes the modelplane-rwx Filestore StorageClass pinned to it, gated on the name being known, so the class is only created once the real network is resolvable and always pins to it. The StorageClass has no Ready condition, so the provider-kubernetes Object uses the SuccessfulCreate readiness policy. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The examples directory had only the multi-node kimi-k2 cache example, whose header still warned that compose-model-cache had not shipped and that applying it would fail with "no composition found for kind ModelCache." The composition ships now, and there was no small, runnable cached example for the common single-pod case. The docs described the cache PVC but not that it is hydrated once by a Job, mounted read-write at /mnt/models across an LWS gang, or what the admin must provision per cloud for the RWX StorageClass. Add examples/cache/qwen-cached.yaml: a public Qwen3-0.6B ModelCache plus a single-pod ModelDeployment that sets --model=/mnt/models explicitly. Drop the stale caveat from kimi-k2.yaml. Expand the concepts.md ModelCache subsection to state the once-hydrated Job, the read-write shared mount, that the engine reads weights locally, and that an uncached deployment fetches at boot and must supply credentials (HF_TOKEN via engine.env). Document the per-cloud storage prerequisites: GKE auto-provisions modelplane-rwx Filestore, while EKS is bring-your-own (aws-efs-csi-driver add-on, EFS file system and mount targets, and a modelplane-rwx-efs StorageClass). Point getting-started.md at examples/cache/. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

The new compose-model-cache function is a `functions/*` workspace member, so the workspace lock must include it. Scaffolding the function left uv.lock stale, which fails the offline `uv lock --locked` check (it can't resolve the workspace without network and reports the members unsatisfiable). Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

compose_gke deserialized the observed GKECluster into an opaque dict to read its network name, recomputed the ClusterProviderConfig name with child_name instead of reading the one it composes, and had two adjacent `if gke_ready and kubeconfig` blocks. Read both through their generated models, source the ProviderConfig name from the observed resource so it survives a naming change, and merge the duplicate blocks. The kubeconfig-secret local is renamed to stop reading as the ProviderConfig. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

compose_replicas built the replica spec from a kwargs dict, which gave up the type checking of constructing SpecModel directly. Build the typed ModelReplica and set spec.modelCacheRef only when the ModelDeployment has one. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

ModelCache inferred its source from which sub-object was set (spec.source.huggingFace). A required discriminator is clearer, simplifies the function, and can't be added later, so spec.source is now a required enum (HuggingFace) with sibling config objects (spec.huggingFace) — matching the InferenceCluster.spec.cluster.source pattern. A CEL rule requires the matching object, which retires the function's runtime no-source guard. Alongside, tighten the composition: - Name the cache PVC and Job with resource.child_name (deterministic hash plus DNS-safe truncation) rather than a hand-rolled slice; the serving side (backends/base.cache_pvc_name) derives the same name. - Give the PVC and Job Objects a DeriveFromCelQuery readiness so each derives its Ready condition from the wrapped resource instead of the function re-parsing status. - Report a failed hydration on the ArtifactReady condition (reason Failed) rather than Hydrating. - Read observed Objects through the generated Pydantic model. - Show spec.clusterSelector in an example and document that omitting it stages the cache on every matched cluster. Tests move to the table-of-Cases request/response golden pattern the other functions use. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

compose-model-cache was scaffolded before the RUF lint rules landed, so its CLI kept a `# noqa:FBT001` that the other functions' entrypoints have since dropped (the directive never suppressed anything, and RUF100 now flags it). Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound · 2026-06-09T14:59:24Z

Thanks @dennis-upbound!

really like the PR description. I found it way easier to read. Did you get an agent to generate it using the new guidance in CONTRIBUTING.md?

yep! thanks for setting up the guidance. I had to still clean a little llm fluff but overall not bad

The design doc still showed `spec.source` as an object holding `huggingFace`. The implemented API makes `source` a required enum discriminator with a sibling `huggingFace` object, enforced by a CEL rule. Update the example and the source description to match. Towards #66. Signed-off-by: Dennis Ramdass <dennis@upbound.io>

dennis-upbound marked this pull request as ready for review June 8, 2026 18:13

dennis-upbound requested a review from negz June 8, 2026 19:00

dennis-upbound force-pushed the dennis/modelcache-v01 branch 2 times, most recently from 296d41d to 1512599 Compare June 8, 2026 21:40

negz reviewed Jun 9, 2026

View reviewed changes

dennis-upbound added 17 commits June 9, 2026 07:54

dennis-upbound force-pushed the dennis/modelcache-v01 branch from c97312d to cdea5e1 Compare June 9, 2026 14:56

dennis-upbound requested a review from negz June 9, 2026 15:03

negz approved these changes Jun 9, 2026

View reviewed changes

negz merged commit f66a99f into main Jun 9, 2026
3 checks passed

negz deleted the dennis/modelcache-v01 branch June 16, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pre-stage model weights with a ModelCache primitive#112

Pre-stage model weights with a ModelCache primitive#112
negz merged 18 commits into
mainfrom
dennis/modelcache-v01

dennis-upbound commented Jun 8, 2026 •

edited

Loading

Uh oh!

negz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dennis-upbound commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

negz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dennis-upbound commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dennis-upbound commented Jun 8, 2026 •

edited

Loading