Skip to content

Replica topology and capability-based scheduling #52

Description

@bassam

What problem are you facing?

Modelplane can't deploy multi-node models. Kimi K2, DeepSeek V3, Llama 4 Behemoth, and similar frontier models require deployment shapes the current schema can't express — multiple nodes coordinating via high-bandwidth interconnect, structured parallelism strategy, hardware capability requirements (FP8 support, NVLink, IB-400g+). The deploy function divides total VRAM by per-GPU memory to compute GPU count. For Kimi K2 this gives a number but not a deployable configuration. There's no way to tell the scheduler that 16 GPUs need to be 2 nodes of 8 with TP=8 + PP=2, not 1 node of 16 (which doesn't exist) or 16 nodes of 1 (wrong sharding).

How could Modelplane help solve your problem?

Frontier models share a deployment shape. Kimi K2 (1T/32B MoE) needs 16 GPUs across 2 H200 nodes connected by IB-Quantum-2, running TP=8 + PP=2 with expert parallelism enabled, FP8 quantization, 141Gi per GPU. DeepSeek V3 has similar requirements. Llama 405B at full precision spans multiple nodes. Multi-node deployment with structured parallelism and hardware-specific requirements is the production pattern for serious models, not an edge case.

KServe LLMInferenceService already supports this. The parallelism block expresses tensor and pipeline parallelism structurally. The workerNodeSize field tells KServe to compose a multi-pod deployment across nodes via LeaderWorkerSet. Each pod gets its share of GPUs; KServe handles the Ray coordination. We just don't expose any of this through the Modelplane API today.

The API needs three things: a way to express the shape of a replica, a way to declare what hardware capabilities pools have, and a matching algorithm that puts them together correctly.

I think the natural place for shape is a replicaTopology block on the serving profile. The block describes one independent serving instance — what hardware it needs, how it shards across nodes, what placement constraints apply. The current resources.vram field is removed (subsumed by the topology block).

apiVersion: modelplane.ai/v1alpha1
kind: ClusterModel
metadata:
  name: kimi-k2-instruct
spec:
  model:
    name: moonshotai/Kimi-K2-Instruct
  source: HuggingFace
  huggingFace:
    repo: moonshotai/Kimi-K2-Instruct

  environmentSelector:
    matchLabels:
      modelplane.ai/tier: production

  serving:
  - name: vllm-h200-multinode
    replicaTopology:
      nodes: 2
      gpusPerNode: 8
      parallelism:
        tensor: 8
        pipeline: 2
        expert: enabled
      requires:
        memoryPerGpu: "141Gi"
        interconnect: nvlink
        multiNodeBandwidth: ib-400g-or-better
        precisionSupport: ["fp8"]
        nvlinkDomainSize: 8
      placement:
        podsAcrossNodes: required
        replicasAcrossNodes: preferred
    engine:
      name: vLLM
      image: vllm/vllm-openai:v0.8.0
      args:
      - "--tensor-parallel-size=8"
      - "--pipeline-parallel-size=2"
      - "--enable-expert-parallel"
      - "--quantization=fp8"
      - "--max-model-len=65536"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--max-num-seqs=256"
      - "--distributed-executor-backend=ray"

The replicaTopology block has clear sub-fields. nodes and gpusPerNode define the deployment shape — KServe's workerNodeSize comes from nodes, and the per-pod GPU resource limit comes from gpusPerNode. parallelism populates KServe's structured parallelism block; this duplicates information that also appears in engine args, but the duplication is unavoidable because KServe needs structured fields for LeaderWorkerSet composition while vLLM needs the args. The serving profile author keeps them consistent.

requires is a list of capability constraints. The deploy function matches these against pool capabilities. Numeric constraints use ≥ comparison (memoryPerGpu: 141Gi means at least 141Gi). Enum constraints use ordered comparison (multiNodeBandwidth: ib-400g-or-better). List constraints use superset comparison (precisionSupport: [fp8] means the pool must support FP8 among its supported formats). These comparisons are simple enough to verify by inspection; v0.1 doesn't try to handle multi-dimensional capability comparison or "or-better" beyond a single ordered enum.

placement controls pod and replica spread. podsAcrossNodes: required is correctness for multi-node replicas — the parallelism doesn't work if both pods land on the same node. replicasAcrossNodes: preferred is HA — multiple replicas should land on different nodes if capacity allows, but can pack if needed.

For the InferenceEnvironment side, node pools need a capabilities block declaring what their hardware supports. The platform engineer provisioning pools knows this; today's schema doesn't have a place to put it.

apiVersion: modelplane.ai/v1alpha1
kind: InferenceEnvironment
metadata:
  name: prod-coreweave-us-east
  labels:
    modelplane.ai/region: us-east
    modelplane.ai/tier: production
spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: cw-cluster-kubeconfig
      provider: coreweave
      instanceType: HGX-H200-IB

  nodePools:
  - name: medium-models
    acceleratorType: nvidia-h100-80gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 8
    labels:
      modelplane.ai/pool: medium
    capabilities:
      memoryPerGpu: "80Gi"
      interconnect: nvlink
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: false
      nvlinkDomainSize: 8

  - name: frontier-multinode
    acceleratorType: nvidia-h200-141gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 4
    labels:
      modelplane.ai/pool: frontier
    capabilities:
      memoryPerGpu: "141Gi"
      interconnect: nvlink
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: true
      interNodeBandwidth: ib-400g
      nvlinkDomainSize: 8

The capabilities block is the hardware truth-teller. Pool memory per GPU, interconnect type, supported precisions, multi-node networking. The multiNodeCapable field is a discriminator — pools without it fail multi-node profile matching even if they have the right GPU type, because their networking isn't configured for it.

provider and instanceType on the cluster are infrastructure context. They don't participate in matching at v0.1, but they enable richer matching later (e.g., recipe-based validation that says "this configuration is validated on Coreweave HGX-H200-IB"). Operators authoring their own InferenceEnvironments can leave these blank.

Profile matching gains capability filtering. The deploy function walks profiles in order; for each profile, walks pools in order; first pool whose capabilities satisfy the profile's replicaTopology.requires and has capacity for nodes × gpusPerNode allocation units wins.

For each environment matching ClusterModel.environmentSelector:
  For each profile in ClusterModel.serving:
    For each pool in environment.nodePools:
      If pool.capabilities satisfies profile.replicaTopology.requires:
        If pool has capacity for profile.replicaTopology.nodes nodes
           with profile.replicaTopology.gpusPerNode GPUs each:
          Match found.
          Compose ModelPlacement with this profile and pool.
          Generate LLMInferenceService:
            replicas: from ModelDeployment scaling
            workerNodeSize: profile.replicaTopology.nodes
            parallelism: profile.replicaTopology.parallelism
            template.nodeSelector: pool.labels
            template.affinity: from profile.replicaTopology.placement
            container.args: profile.engine.args
            container.resources.limits[nvidia.com/gpu]: profile.replicaTopology.gpusPerNode

The matching algorithm is deterministic given declaration order. Operators control behavior by ordering profiles (priority) and pools (preference). Two pools matching the same profile in the same environment is resolved by pool declaration order; v0.1 doesn't try to be clever about cost-aware or capability-conservative selection. v0.2+ can add this.

For Kimi K2 specifically, the matching cascade looks like: filter environments by tier: production (the production environment matches), walk profiles (only vllm-h200-multinode), walk pools (medium-models fails on multiNodeCapable=false; frontier-multinode satisfies all requirements), match found, compose LLMInferenceService with workerNodeSize: 2, parallelism: {tensor: 8, pipeline: 2}, anti-affinity required across nodes, GPU limit 8 per pod, nodeSelector targeting the frontier pool's labels.

The composed LLMInferenceService:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: kimi-k2-instruct
  namespace: ml-team-research
spec:
  model:
    uri: hf://moonshotai/Kimi-K2-Instruct
  replicas: 1
  workerNodeSize: 2
  parallelism:
    tensor: 8
    pipeline: 2
  template:
    spec:
      nodeSelector:
        modelplane.ai/pool: frontier
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: kimi-k2-instruct
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=moonshotai/Kimi-K2-Instruct
        - --tensor-parallel-size=8
        - --pipeline-parallel-size=2
        - --enable-expert-parallel
        - --quantization=fp8
        - --max-model-len=65536
        - --gpu-memory-utilization=0.90
        - --enable-prefix-caching
        - --max-num-seqs=256
        - --distributed-executor-backend=ray
        resources:
          limits:
            nvidia.com/gpu: 8

KServe creates a LeaderWorkerSet with 2 worker pods per replica. The cluster autoscaler provisions H200 nodes from the frontier pool. Pods schedule one per node (anti-affinity required). vLLM starts in each pod, Ray sets up communication over IB, model loads with TP=8 within each node and PP=2 across the two. LLMInferenceService becomes ready.

The same API handles single-node deployments. A Gemma 3 27B profile sets nodes: 1, gpusPerNode: 1, requires.memoryPerGpu: 40Gi, requires.precisionSupport: [fp8]. Match against a medium-models pool (H100, supports FP8, has 80Gi available). Compose LLMInferenceService with workerNodeSize: 1 (no LeaderWorkerSet, just a regular Deployment), GPU limit 1, nodeSelector for the medium pool. Single-node deployments are the special case where multi-node machinery isn't needed; the schema accommodates both uniformly.

This proposal sketches one shape but the design space is meaningfully bigger than what I've laid out, and getting it wrong is expensive — schema decisions made here will be hard to change once operators have ClusterModels in production. This needs a full design that accounts for the different topology patterns and the tradeoffs between them.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions