Skip to content

Pre-stage model weights with a ModelCache primitive#112

Merged
negz merged 18 commits into
mainfrom
dennis/modelcache-v01
Jun 9, 2026
Merged

Pre-stage model weights with a ModelCache primitive#112
negz merged 18 commits into
mainfrom
dennis/modelcache-v01

Conversation

@dennis-upbound

@dennis-upbound dennis-upbound commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Fixes #66.

Serving pods fetched model weights from HuggingFace at every boot. For frontier models that's tens to hundreds of GB pulled on every pod start, and in a multi-node LeaderWorkerSet gang each pod pulls independently — so an 8-pod gang downloads the same 600GB eight times before it can serve. There was no way to stage weights once and share them.

This adds ModelCache, a v0.1 primitive that pre-stages a HuggingFace model onto a per-cluster ReadWriteMany PVC via a one-shot hydration Job, so serving pods mount the weights instead of downloading them. A deployment opts in with spec.modelCacheRef; the cache PVC mounts at /mnt/models on every serving pod, shared across the whole gang, and the weights are downloaded once per cluster and read N times.

The compose-model-cache function fans a ModelCache out to every matched InferenceCluster: a RWX PVC sized from the source, plus a hydration Job that runs hf download into it. Hydration is re-run safe via a completion marker, so an interrupted download resumes rather than serving truncated weights. Per-cluster phase (Pending/Hydrating/Ready/Failed) and an x/y ready summary are reported on the cache's status.

modelCacheRef now threads from ModelDeployment through to each ModelReplica, and the native and llm-d backends mount the cache. The mount is engine-agnostic — it lands on the native Deployment pod and on both the leader and worker templates of an llm-d gang. --model=/mnt/models is injected only for the turnkey vLLM path; a bring-your-own engine like SGLang sets its own --model-path, so injection is skipped when the engine supplies its own command.

On GKE the modelplane-rwx storage class self-provisions: compose-gke-cluster enables the Filestore API and compose-inference-cluster composes a VPC-pinned Filestore StorageClass. On EKS the cache works against an admin-provided EFS StorageClass (modelplane-rwx-efs); auto-provisioning EFS is a separate follow-up.

Before, a deployment had to bake the model into the engine args and re-pull every boot:

spec:
  workers:
    template:
      spec:
        containers:
        - name: engine
          args: ["--model=Qwen/Qwen3-0.6B"]   # fetched from HF on every pod start

After, the cache is staged once and mounted:

spec:
  modelCacheRef:
    name: qwen
  workers:
    template:
      spec:
        containers:
        - name: engine
          args: ["--model=/mnt/models"]       # read from the mounted PVC

Scope is locked to the HuggingFace source plus a Modelplane-managed PVC, matching the merged XRD. EKS works against an admin-provided EFS StorageClass; auto-provisioning EFS is a separate follow-up.

How it was validated so far: unit tests for every changed function (the cache function's PVC/Job/status/conditions, the deployment→replica propagation, and the backend mounts including the vLLM-inject vs SGLang-skip split). Live cluster validation is pending.

I have:

  • Read and followed Modelplane's contribution process.
  • Run nix flake check and made sure it passes. Per-function unit tests pass and ruff lint/format is clean locally; the full sandboxed nix flake check runs in CI on this PR.
  • Added or updated tests covering the composition function changes.
  • Signed off every commit with git commit -s.

@dennis-upbound dennis-upbound marked this pull request as ready for review June 8, 2026 18:13
@dennis-upbound dennis-upbound requested a review from negz June 8, 2026 19:00
@dennis-upbound dennis-upbound force-pushed the dennis/modelcache-v01 branch 2 times, most recently from 296d41d to 1512599 Compare June 8, 2026 21:40

@negz negz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dennis-upbound!

really like the PR description. I found it way easier to read. Did you get an agent to generate it using the new guidance in CONTRIBUTING.md?

Comment thread docs/concepts.md
Comment thread examples/cache/qwen-cached.yaml Outdated
Comment thread functions/compose-model-cache/function/fn.py Outdated
Comment thread functions/compose-model-cache/function/fn.py
Comment thread functions/compose-inference-cluster/function/fn.py Outdated
Comment thread functions/compose-model-cache/function/fn.py Outdated
Comment thread functions/compose-model-cache/function/fn.py Outdated
Comment thread functions/compose-model-cache/function/fn.py Outdated
Comment thread functions/compose-model-cache/function/fn.py
Comment thread functions/compose-model-cache/tests/test_fn.py
The ModelCache XRD merged with a definition but no composition or
composition function, so applying a ModelCache produced an XR that
never reconciled into anything.

Add the compose-model-cache function package (mirroring the
compose-model-replica layout), a Pipeline composition under
apis/modelcaches that references it, and the function tarball entry in
crossplane-project.yaml. The function carries the full Composer
skeleton: it parses the XR, guards against an unset source, and calls
through a fixed pipeline of stubs that later tasks replace with real
cluster matching, PVC/Job composition, and status reporting. The
REMOTE_NS / PVC-naming constants and comments document the cross-
function contract with the serving backends.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The skeleton Composer matched no clusters and composed nothing: its
resolve_inputs/match_clusters/compose_cluster_resources were no-op stubs,
so a ModelCache never staged its weights anywhere.

Resolve the InferenceCluster required-resource set (gated on the
requirement key's presence, since get_required_resources returns [] for
both the unresolved and resolved-empty cases), keep the clusters that have
finished provisioning (providerConfigRef set), and emit a ReadWriteMany
PVC per matched cluster wrapped in a provider-kubernetes Object pointed at
that cluster's ClusterProviderConfig. The PVC is named
modelcache-<namespace>-<name> (truncated to 63) so caches of the same name
from different Modelplane namespaces don't collide in the workload
cluster's default namespace, matching the name the serving backends will
compute. Its storage class comes from the cluster's per-source cache block,
falling back to the source's XRD default (GKE -> modelplane-rwx,
EKS -> modelplane-rwx-efs) since Pydantic doesn't apply the nested default
when the cache block is omitted entirely.

Resources are always emitted for a matched cluster, never gated on
readiness: omitting an Object tells Crossplane to delete it, which would
re-trigger hydration on every dependency flap. The hydration Job's manifest
is a placeholder here and gets its real HuggingFace download in the next
change.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Serving pods re-pulled model weights from HuggingFace at boot, and the
ModelCache function carried only a placeholder Job manifest, so the RWX
PVC composed per cluster was never populated. This composes the real
one-shot hydration Job: it pip-installs huggingface_hub and runs
`hf download <repo>[ --revision X] --local-dir /mnt/artifact` into the
cache PVC, wiring HF_TOKEN from the optional authSecret.

Idempotency uses a completion marker (.modelplane-hydrated) touched only
after a successful download under `set -e`, and the Job skips when the
marker is present. Keying on the marker rather than directory emptiness
makes re-runs safe: an interrupted pull leaves files but no marker, so a
retry resumes (hf download is resumable) instead of falsely concluding
the cache is complete and serving truncated weights. It also sidesteps
the Filestore lost+found directory that broke a bare emptiness check.
Uses `hf download`, not the removed huggingface-cli (dropped in
huggingface-hub 1.x), which previously killed the Job at install.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The function composed the PVC and hydration Job per cluster but never
reported whether the artifact had actually staged: status, conditions,
and the composite's readiness flag were no-op stubs, so a ModelCache
showed no phase and downstream waiters had nothing to gate on.

Derive each cluster's phase from the remote PVC/Job status the provider
echoes back under Object.status.atProvider.manifest.status — PVC Bound
plus Job succeeded is Ready, PVC Bound alone is Hydrating, a failed Job
condition is Failed, otherwise Pending. Write a per-cluster status with
an "x/y" ready summary, set ArtifactReady (and the composite ready flag)
only when every matched cluster is Ready, and mark the PVC/Job Objects
ready after compose so update() doesn't reset the flag. Emit one-time
transition events on first compose and on reaching all-ready to keep
describe output quiet.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Phase 1 review found two gaps. The mark_ready_resources docstring
claimed it must run after resource.update() because update() resets
the ready flag, but update() only writes the protobuf .resource field
and never touches the sibling .ready field — the real ordering reason
is that the desired entries must be composed first. The test suite also
left two derive_cluster_phase/derive_conditions branches uncovered: a
cluster whose hydration Job failed, and partial readiness across two
clusters where one is Ready and one still Hydrating.

Reword the comment to state the actual ordering reason and add tests
for the Failed phase (ready 0/1, XR not ready) and the Partial case
(ready 1/2, ArtifactReady False/Partial, XR not ready).

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
compose_replicas built each replica's SpecModel with only clusterName
and workers, dropping the deployment's modelCacheRef. Replicas therefore
never learned which cache to mount, so the backend had no way to know a
ModelCache should back the workload.

Thread modelCacheRef through: when the deployment sets spec.modelCacheRef,
the composed replica's spec carries mrv1alpha1.ModelCacheRef(name=...).
The ref is only emitted when set, so deployments without a cache compose
unchanged.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
With KServe dropped, nothing mounts a ModelCache PVC into the serving
pods or points the engine at it, and vLLM falls back to serving
facebook/opt-125m when no model path is given. The serving backends
(native, llm-d) need a single, consistent way to wire a referenced
cache into the engine.

Add cache_pvc_name(), cache_mounts(), and apply_cache_args() to
backends.base, plus the CACHE_MOUNT_PATH, _CACHE_VOLUME, and
PVC_NAME_PREFIX constants. cache_pvc_name derives the workload PVC name
as f"modelcache-{namespace}-{name}"[:63], identical to
compose-model-cache's _pvc_name(), so serving pods mount the claim the
cache actually created. cache_mounts returns the RWX PVC volume and a
read-write mount at /mnt/models (engines write tokenizer/compile/lock
artifacts, so a readOnly mount would hard-fail them). apply_cache_args
injects --model=/mnt/models only for the turnkey vLLM path: it is
skipped when no cache is referenced, when the engine brings its own
command (a non-vLLM engine like SGLang owns its args and uses
--model-path), or when the user already set --model.

These helpers are additive; Tasks 7 and 8 wire them into the backends.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The native single-pod backend ignored a replica's modelCacheRef: the
engine pod had no volume for the cache PVC and no mount, so a deployment
asking to serve from a warmed cache would instead fetch weights from
their source at startup (or fail to find them).

Wire the shared cache helpers into NativeBackend.build: cache_mounts
adds the model-cache volume and its /mnt/models mount when a cache is
referenced (empty lists otherwise, so the no-cache path is byte-for-byte
unchanged), and apply_cache_args fills in --model=/mnt/models only when
the engine hasn't set it and has no command of its own — leaving a
single-pod SGLang's --model-path intact.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The native backend mounts a referenced ModelCache, but the multi-node
llm-d backend did not, so a gang-scheduled replica would start its
LeaderWorkerSet pods with no /mnt/models and fall back to fetching
weights from source on every node — defeating the cache and risking
divergent shards across the gang.

Thread base.cache_mounts(replica) through the inner container() and
pod_spec() builders so the cache volume and /mnt/models mount land on
both the leader and worker templates; every node of the gang loads its
shard from the shared RWX PVC. For the turnkey vLLM bootstrap, also run
base.apply_cache_args over the leader command's args so --model defaults
to the mount when absent. Leave the bring-your-own command path (SGLang
etc.) untouched: it sets --model-path itself, and apply_cache_args
no-ops when the engine has a command, so injecting --model would only
corrupt a verbatim user command.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
Fresh GCP projects have file.googleapis.com disabled, so the Filestore
CSI addon cannot provision the RWX volumes that ModelCache relies on:
cache PVCs sit Pending and provisioning fails with SERVICE_DISABLED.

Compose a ProjectService alongside the GKE networking that enables
file.googleapis.com for the cluster's project, with disableOnDestroy
false so tearing down a cluster does not disable the API for other
workloads in the project. Track it in mark_readiness alongside the
other managed resources.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
A ModelCache PVC on GKE needs a ReadWriteMany StorageClass. The GKE Filestore
CSI driver provisions RWX volumes, but it defaults to the `default` VPC, so a
PVC on a Modelplane-provisioned cluster hangs Pending. The StorageClass must
pin parameters.network to the cluster's own VPC.

That network name can't be derived from the XR: compose-gke-cluster composes
the VPC Network without a fixed name, so Crossplane gives it a provider-assigned
suffix and the real GCP network is <name>-<suffix>. Pinning to the bare XR name
fails with "network '<name>' does not exist" (verified live on GKE), which would
defeat the StorageClass — it exists precisely to keep PVCs off the default VPC.

compose-gke-cluster now reads the composed Network's external-name once observed
and reports it on GKECluster.status.network.name. compose-inference-cluster reads
that and composes the modelplane-rwx Filestore StorageClass pinned to it, gated
on the name being known, so the class is only created once the real network is
resolvable and always pins to it. The StorageClass has no Ready condition, so the
provider-kubernetes Object uses the SuccessfulCreate readiness policy.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The examples directory had only the multi-node kimi-k2 cache example,
whose header still warned that compose-model-cache had not shipped and
that applying it would fail with "no composition found for kind
ModelCache." The composition ships now, and there was no small,
runnable cached example for the common single-pod case. The docs
described the cache PVC but not that it is hydrated once by a Job,
mounted read-write at /mnt/models across an LWS gang, or what the
admin must provision per cloud for the RWX StorageClass.

Add examples/cache/qwen-cached.yaml: a public Qwen3-0.6B ModelCache
plus a single-pod ModelDeployment that sets --model=/mnt/models
explicitly. Drop the stale caveat from kimi-k2.yaml. Expand the
concepts.md ModelCache subsection to state the once-hydrated Job, the
read-write shared mount, that the engine reads weights locally, and
that an uncached deployment fetches at boot and must supply
credentials (HF_TOKEN via engine.env). Document the per-cloud storage
prerequisites: GKE auto-provisions modelplane-rwx Filestore, while EKS
is bring-your-own (aws-efs-csi-driver add-on, EFS file system and
mount targets, and a modelplane-rwx-efs StorageClass). Point
getting-started.md at examples/cache/.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
The new compose-model-cache function is a `functions/*` workspace member, so
the workspace lock must include it. Scaffolding the function left uv.lock
stale, which fails the offline `uv lock --locked` check (it can't resolve the
workspace without network and reports the members unsatisfiable).

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
compose_gke deserialized the observed GKECluster into an opaque dict to read
its network name, recomputed the ClusterProviderConfig name with child_name
instead of reading the one it composes, and had two adjacent
`if gke_ready and kubeconfig` blocks.

Read both through their generated models, source the ProviderConfig name from
the observed resource so it survives a naming change, and merge the duplicate
blocks. The kubeconfig-secret local is renamed to stop reading as the
ProviderConfig.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
compose_replicas built the replica spec from a kwargs dict, which gave up the
type checking of constructing SpecModel directly. Build the typed ModelReplica
and set spec.modelCacheRef only when the ModelDeployment has one.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
ModelCache inferred its source from which sub-object was set
(spec.source.huggingFace). A required discriminator is clearer, simplifies the
function, and can't be added later, so spec.source is now a required enum
(HuggingFace) with sibling config objects (spec.huggingFace) — matching the
InferenceCluster.spec.cluster.source pattern. A CEL rule requires the matching
object, which retires the function's runtime no-source guard.

Alongside, tighten the composition:

- Name the cache PVC and Job with resource.child_name (deterministic hash plus
  DNS-safe truncation) rather than a hand-rolled slice; the serving side
  (backends/base.cache_pvc_name) derives the same name.
- Give the PVC and Job Objects a DeriveFromCelQuery readiness so each derives
  its Ready condition from the wrapped resource instead of the function
  re-parsing status.
- Report a failed hydration on the ArtifactReady condition (reason Failed)
  rather than Hydrating.
- Read observed Objects through the generated Pydantic model.
- Show spec.clusterSelector in an example and document that omitting it stages
  the cache on every matched cluster.

Tests move to the table-of-Cases request/response golden pattern the other
functions use.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
compose-model-cache was scaffolded before the RUF lint rules landed, so its
CLI kept a `# noqa:FBT001` that the other functions' entrypoints have since
dropped (the directive never suppressed anything, and RUF100 now flags it).

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@dennis-upbound dennis-upbound force-pushed the dennis/modelcache-v01 branch from c97312d to cdea5e1 Compare June 9, 2026 14:56
@dennis-upbound

Copy link
Copy Markdown
Collaborator Author

Thanks @dennis-upbound!

really like the PR description. I found it way easier to read. Did you get an agent to generate it using the new guidance in CONTRIBUTING.md?

yep! thanks for setting up the guidance. I had to still clean a little llm fluff but overall not bad

@dennis-upbound dennis-upbound requested a review from negz June 9, 2026 15:03
The design doc still showed `spec.source` as an object holding `huggingFace`.
The implemented API makes `source` a required enum discriminator with a sibling
`huggingFace` object, enforced by a CEL rule. Update the example and the source
description to match.

Towards #66.

Signed-off-by: Dennis Ramdass <dennis@upbound.io>
@negz negz merged commit f66a99f into main Jun 9, 2026
3 checks passed
@negz negz deleted the dennis/modelcache-v01 branch June 16, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ModelCache v0.1 — PVC backend, multi-node

2 participants