Problem
Multi-node inference with LeaderWorkerSet requires all pods (leader + workers) to access the same model files. Today, Modelplane sets model.uri: hf://... on the LLMInferenceService, which triggers a per-pod init container download to emptyDir. This works for single-node but breaks for multi-node — each pod would download independently (600GB x N pods), and KServe explicitly requires a ReadWriteMany PVC for multi-node.
ref: https://kserve.github.io/website/docs/model-serving/generative-inference/multi-node
Proposal
When multi-node is detected (model VRAM exceeds a single node), the composition should:
- Compose a ReadWriteMany PVC on the remote cluster using a storage class declared on the InferenceEnvironment
- Compose a one-shot download Job that pulls the model from HuggingFace to the PVC
- Set
model.uri: pvc://<name>/<path> instead of hf://... on the LLMInferenceService
- All pods (leader + workers) mount the shared PVC
For single-node models, the current init container + emptyDir approach stays as-is.
Where storage is configured
The InferenceEnvironment declares available shared storage. This is a platform team concern — they know what RWX storage class their cluster supports (GCP Filestore, AWS EFS, Azure Files, etc).
# InferenceEnvironment
spec:
cluster:
source: GKE
gke:
project: my-project
region: us-central1
nodePools: [...]
storage:
storageClassName: filestore-rwx
When the scheduler detects multi-node is needed and the target environment has no storage.storageClassName, it rejects with a clear condition: "multi-node model requires shared storage on the inference environment."
What gets composed for multi-node
- PVC — ReadWriteMany, sized to the model (e.g.
resources.vram as a rough proxy, or a new resources.disk field)
- Download Job — uses
kserve/storage-initializer image, runs hf:// download to the PVC mount, runs once to completion
- LLMInferenceService —
model.uri set to pvc:// instead of hf://, all pods mount the shared PVC
What stays the same
- Single-node models: init container + emptyDir, no PVC, no Job
- The scheduling function: already detects multi-node via VRAM math
- The ClusterModel spec: no changes, storage is infrastructure
Context
Hit this while deploying Kimi K2.6 (1T MoE, 16x A100 across 4 nodes). The multi-node composition (parallelism, worker spec) works but model download fails because each pod tries to download 600GB independently. Also ran into KServe storage initializer OOMing at 4Gi/8Gi/16Gi for large model downloads — the Job-based approach avoids init container memory limits entirely.
Problem
Multi-node inference with LeaderWorkerSet requires all pods (leader + workers) to access the same model files. Today, Modelplane sets
model.uri: hf://...on the LLMInferenceService, which triggers a per-pod init container download to emptyDir. This works for single-node but breaks for multi-node — each pod would download independently (600GB x N pods), and KServe explicitly requires a ReadWriteMany PVC for multi-node.ref: https://kserve.github.io/website/docs/model-serving/generative-inference/multi-node
Proposal
When multi-node is detected (model VRAM exceeds a single node), the composition should:
model.uri: pvc://<name>/<path>instead ofhf://...on the LLMInferenceServiceFor single-node models, the current init container + emptyDir approach stays as-is.
Where storage is configured
The InferenceEnvironment declares available shared storage. This is a platform team concern — they know what RWX storage class their cluster supports (GCP Filestore, AWS EFS, Azure Files, etc).
When the scheduler detects multi-node is needed and the target environment has no
storage.storageClassName, it rejects with a clear condition: "multi-node model requires shared storage on the inference environment."What gets composed for multi-node
resources.vramas a rough proxy, or a newresources.diskfield)kserve/storage-initializerimage, runshf://download to the PVC mount, runs once to completionmodel.uriset topvc://instead ofhf://, all pods mount the shared PVCWhat stays the same
Context
Hit this while deploying Kimi K2.6 (1T MoE, 16x A100 across 4 nodes). The multi-node composition (parallelism, worker spec) works but model download fails because each pod tries to download 600GB independently. Also ran into KServe storage initializer OOMing at 4Gi/8Gi/16Gi for large model downloads — the Job-based approach avoids init container memory limits entirely.