Skip to content

Shared storage for multi-node inference #61

Description

@dennis-upbound

Problem

Multi-node inference with LeaderWorkerSet requires all pods (leader + workers) to access the same model files. Today, Modelplane sets model.uri: hf://... on the LLMInferenceService, which triggers a per-pod init container download to emptyDir. This works for single-node but breaks for multi-node — each pod would download independently (600GB x N pods), and KServe explicitly requires a ReadWriteMany PVC for multi-node.

ref: https://kserve.github.io/website/docs/model-serving/generative-inference/multi-node

Proposal

When multi-node is detected (model VRAM exceeds a single node), the composition should:

  1. Compose a ReadWriteMany PVC on the remote cluster using a storage class declared on the InferenceEnvironment
  2. Compose a one-shot download Job that pulls the model from HuggingFace to the PVC
  3. Set model.uri: pvc://<name>/<path> instead of hf://... on the LLMInferenceService
  4. All pods (leader + workers) mount the shared PVC

For single-node models, the current init container + emptyDir approach stays as-is.

Where storage is configured

The InferenceEnvironment declares available shared storage. This is a platform team concern — they know what RWX storage class their cluster supports (GCP Filestore, AWS EFS, Azure Files, etc).

# InferenceEnvironment
spec:
  cluster:
    source: GKE
    gke:
      project: my-project
      region: us-central1
      nodePools: [...]
    storage:
      storageClassName: filestore-rwx

When the scheduler detects multi-node is needed and the target environment has no storage.storageClassName, it rejects with a clear condition: "multi-node model requires shared storage on the inference environment."

What gets composed for multi-node

  1. PVC — ReadWriteMany, sized to the model (e.g. resources.vram as a rough proxy, or a new resources.disk field)
  2. Download Job — uses kserve/storage-initializer image, runs hf:// download to the PVC mount, runs once to completion
  3. LLMInferenceServicemodel.uri set to pvc:// instead of hf://, all pods mount the shared PVC

What stays the same

  • Single-node models: init container + emptyDir, no PVC, no Job
  • The scheduling function: already detects multi-node via VRAM math
  • The ClusterModel spec: no changes, storage is infrastructure

Context

Hit this while deploying Kimi K2.6 (1T MoE, 16x A100 across 4 nodes). The multi-node composition (parallelism, worker spec) works but model download fails because each pod tries to download 600GB independently. Also ran into KServe storage initializer OOMing at 4Gi/8Gi/16Gi for large model downloads — the Job-based approach avoids init container memory limits entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions