Shared storage for multi-node inference

## Problem

Multi-node inference with LeaderWorkerSet requires all pods (leader + workers) to access the same model files. Today, Modelplane sets `model.uri: hf://...` on the LLMInferenceService, which triggers a per-pod init container download to emptyDir. This works for single-node but breaks for multi-node — each pod would download independently (600GB x N pods), and KServe explicitly requires a ReadWriteMany PVC for multi-node.

ref: https://kserve.github.io/website/docs/model-serving/generative-inference/multi-node

## Proposal

When multi-node is detected (model VRAM exceeds a single node), the composition should:

1. Compose a ReadWriteMany PVC on the remote cluster using a storage class declared on the InferenceEnvironment
2. Compose a one-shot download Job that pulls the model from HuggingFace to the PVC
3. Set `model.uri: pvc://<name>/<path>` instead of `hf://...` on the LLMInferenceService
4. All pods (leader + workers) mount the shared PVC

For single-node models, the current init container + emptyDir approach stays as-is.

### Where storage is configured

The InferenceEnvironment declares available shared storage. This is a platform team concern — they know what RWX storage class their cluster supports (GCP Filestore, AWS EFS, Azure Files, etc).

```yaml
# InferenceEnvironment
spec:
  cluster:
    source: GKE
    gke:
      project: my-project
      region: us-central1
      nodePools: [...]
    storage:
      storageClassName: filestore-rwx
```

When the scheduler detects multi-node is needed and the target environment has no `storage.storageClassName`, it rejects with a clear condition: "multi-node model requires shared storage on the inference environment."

### What gets composed for multi-node

1. **PVC** — ReadWriteMany, sized to the model (e.g. `resources.vram` as a rough proxy, or a new `resources.disk` field)
2. **Download Job** — uses `kserve/storage-initializer` image, runs `hf://` download to the PVC mount, runs once to completion
3. **LLMInferenceService** — `model.uri` set to `pvc://` instead of `hf://`, all pods mount the shared PVC

### What stays the same

- Single-node models: init container + emptyDir, no PVC, no Job
- The scheduling function: already detects multi-node via VRAM math
- The ClusterModel spec: no changes, storage is infrastructure

## Context

Hit this while deploying Kimi K2.6 (1T MoE, 16x A100 across 4 nodes). The multi-node composition (parallelism, worker spec) works but model download fails because each pod tries to download 600GB independently. Also ran into KServe storage initializer OOMing at 4Gi/8Gi/16Gi for large model downloads — the Job-based approach avoids init container memory limits entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shared storage for multi-node inference #61

Problem

Proposal

Where storage is configured

What gets composed for multi-node

What stays the same

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Shared storage for multi-node inference #61

Description

Problem

Proposal

Where storage is configured

What gets composed for multi-node

What stays the same

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions