Replica topology and capability-based scheduling

### What problem are you facing?

Modelplane can't deploy multi-node models. Kimi K2, DeepSeek V3, Llama 4 Behemoth, and similar frontier models require deployment shapes the current schema can't express — multiple nodes coordinating via high-bandwidth interconnect, structured parallelism strategy, hardware capability requirements (FP8 support, NVLink, IB-400g+). The deploy function divides total VRAM by per-GPU memory to compute GPU count. For Kimi K2 this gives a number but not a deployable configuration. There's no way to tell the scheduler that 16 GPUs need to be 2 nodes of 8 with TP=8 + PP=2, not 1 node of 16 (which doesn't exist) or 16 nodes of 1 (wrong sharding).

### How could Modelplane help solve your problem?

Frontier models share a deployment shape. Kimi K2 (1T/32B MoE) needs 16 GPUs across 2 H200 nodes connected by IB-Quantum-2, running TP=8 + PP=2 with expert parallelism enabled, FP8 quantization, 141Gi per GPU. DeepSeek V3 has similar requirements. Llama 405B at full precision spans multiple nodes. Multi-node deployment with structured parallelism and hardware-specific requirements is the production pattern for serious models, not an edge case.

KServe `LLMInferenceService` already supports this. The `parallelism` block expresses tensor and pipeline parallelism structurally. The `workerNodeSize` field tells KServe to compose a multi-pod deployment across nodes via LeaderWorkerSet. Each pod gets its share of GPUs; KServe handles the Ray coordination. We just don't expose any of this through the Modelplane API today.

The API needs three things: a way to express the shape of a replica, a way to declare what hardware capabilities pools have, and a matching algorithm that puts them together correctly.

I think the natural place for shape is a `replicaTopology` block on the serving profile. The block describes one independent serving instance — what hardware it needs, how it shards across nodes, what placement constraints apply. The current `resources.vram` field is removed (subsumed by the topology block).

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: ClusterModel
metadata:
  name: kimi-k2-instruct
spec:
  model:
    name: moonshotai/Kimi-K2-Instruct
  source: HuggingFace
  huggingFace:
    repo: moonshotai/Kimi-K2-Instruct

  environmentSelector:
    matchLabels:
      modelplane.ai/tier: production

  serving:
  - name: vllm-h200-multinode
    replicaTopology:
      nodes: 2
      gpusPerNode: 8
      parallelism:
        tensor: 8
        pipeline: 2
        expert: enabled
      requires:
        memoryPerGpu: "141Gi"
        interconnect: nvlink
        multiNodeBandwidth: ib-400g-or-better
        precisionSupport: ["fp8"]
        nvlinkDomainSize: 8
      placement:
        podsAcrossNodes: required
        replicasAcrossNodes: preferred
    engine:
      name: vLLM
      image: vllm/vllm-openai:v0.8.0
      args:
      - "--tensor-parallel-size=8"
      - "--pipeline-parallel-size=2"
      - "--enable-expert-parallel"
      - "--quantization=fp8"
      - "--max-model-len=65536"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--max-num-seqs=256"
      - "--distributed-executor-backend=ray"
```

The `replicaTopology` block has clear sub-fields. `nodes` and `gpusPerNode` define the deployment shape — KServe's `workerNodeSize` comes from `nodes`, and the per-pod GPU resource limit comes from `gpusPerNode`. `parallelism` populates KServe's structured parallelism block; this duplicates information that also appears in engine args, but the duplication is unavoidable because KServe needs structured fields for LeaderWorkerSet composition while vLLM needs the args. The serving profile author keeps them consistent.

`requires` is a list of capability constraints. The deploy function matches these against pool capabilities. Numeric constraints use ≥ comparison (`memoryPerGpu: 141Gi` means at least 141Gi). Enum constraints use ordered comparison (`multiNodeBandwidth: ib-400g-or-better`). List constraints use superset comparison (`precisionSupport: [fp8]` means the pool must support FP8 among its supported formats). These comparisons are simple enough to verify by inspection; v0.1 doesn't try to handle multi-dimensional capability comparison or "or-better" beyond a single ordered enum.

`placement` controls pod and replica spread. `podsAcrossNodes: required` is correctness for multi-node replicas — the parallelism doesn't work if both pods land on the same node. `replicasAcrossNodes: preferred` is HA — multiple replicas should land on different nodes if capacity allows, but can pack if needed.

For the InferenceEnvironment side, node pools need a `capabilities` block declaring what their hardware supports. The platform engineer provisioning pools knows this; today's schema doesn't have a place to put it.

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: InferenceEnvironment
metadata:
  name: prod-coreweave-us-east
  labels:
    modelplane.ai/region: us-east
    modelplane.ai/tier: production
spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: cw-cluster-kubeconfig
      provider: coreweave
      instanceType: HGX-H200-IB

  nodePools:
  - name: medium-models
    acceleratorType: nvidia-h100-80gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 8
    labels:
      modelplane.ai/pool: medium
    capabilities:
      memoryPerGpu: "80Gi"
      interconnect: nvlink
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: false
      nvlinkDomainSize: 8

  - name: frontier-multinode
    acceleratorType: nvidia-h200-141gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 4
    labels:
      modelplane.ai/pool: frontier
    capabilities:
      memoryPerGpu: "141Gi"
      interconnect: nvlink
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: true
      interNodeBandwidth: ib-400g
      nvlinkDomainSize: 8
```

The capabilities block is the hardware truth-teller. Pool memory per GPU, interconnect type, supported precisions, multi-node networking. The `multiNodeCapable` field is a discriminator — pools without it fail multi-node profile matching even if they have the right GPU type, because their networking isn't configured for it.

`provider` and `instanceType` on the cluster are infrastructure context. They don't participate in matching at v0.1, but they enable richer matching later (e.g., recipe-based validation that says "this configuration is validated on Coreweave HGX-H200-IB"). Operators authoring their own InferenceEnvironments can leave these blank.

Profile matching gains capability filtering. The deploy function walks profiles in order; for each profile, walks pools in order; first pool whose `capabilities` satisfy the profile's `replicaTopology.requires` and has capacity for `nodes × gpusPerNode` allocation units wins.

```
For each environment matching ClusterModel.environmentSelector:
  For each profile in ClusterModel.serving:
    For each pool in environment.nodePools:
      If pool.capabilities satisfies profile.replicaTopology.requires:
        If pool has capacity for profile.replicaTopology.nodes nodes
           with profile.replicaTopology.gpusPerNode GPUs each:
          Match found.
          Compose ModelPlacement with this profile and pool.
          Generate LLMInferenceService:
            replicas: from ModelDeployment scaling
            workerNodeSize: profile.replicaTopology.nodes
            parallelism: profile.replicaTopology.parallelism
            template.nodeSelector: pool.labels
            template.affinity: from profile.replicaTopology.placement
            container.args: profile.engine.args
            container.resources.limits[nvidia.com/gpu]: profile.replicaTopology.gpusPerNode
```

The matching algorithm is deterministic given declaration order. Operators control behavior by ordering profiles (priority) and pools (preference). Two pools matching the same profile in the same environment is resolved by pool declaration order; v0.1 doesn't try to be clever about cost-aware or capability-conservative selection. v0.2+ can add this.

For Kimi K2 specifically, the matching cascade looks like: filter environments by `tier: production` (the production environment matches), walk profiles (only `vllm-h200-multinode`), walk pools (medium-models fails on multiNodeCapable=false; frontier-multinode satisfies all requirements), match found, compose LLMInferenceService with `workerNodeSize: 2`, `parallelism: {tensor: 8, pipeline: 2}`, anti-affinity required across nodes, GPU limit 8 per pod, nodeSelector targeting the frontier pool's labels.

The composed LLMInferenceService:

```yaml
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: kimi-k2-instruct
  namespace: ml-team-research
spec:
  model:
    uri: hf://moonshotai/Kimi-K2-Instruct
  replicas: 1
  workerNodeSize: 2
  parallelism:
    tensor: 8
    pipeline: 2
  template:
    spec:
      nodeSelector:
        modelplane.ai/pool: frontier
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: kimi-k2-instruct
            topologyKey: kubernetes.io/hostname
      containers:
      - name: server
        image: vllm/vllm-openai:v0.8.0
        args:
        - --model=moonshotai/Kimi-K2-Instruct
        - --tensor-parallel-size=8
        - --pipeline-parallel-size=2
        - --enable-expert-parallel
        - --quantization=fp8
        - --max-model-len=65536
        - --gpu-memory-utilization=0.90
        - --enable-prefix-caching
        - --max-num-seqs=256
        - --distributed-executor-backend=ray
        resources:
          limits:
            nvidia.com/gpu: 8
```

KServe creates a LeaderWorkerSet with 2 worker pods per replica. The cluster autoscaler provisions H200 nodes from the frontier pool. Pods schedule one per node (anti-affinity required). vLLM starts in each pod, Ray sets up communication over IB, model loads with TP=8 within each node and PP=2 across the two. LLMInferenceService becomes ready.

The same API handles single-node deployments. A Gemma 3 27B profile sets `nodes: 1, gpusPerNode: 1, requires.memoryPerGpu: 40Gi, requires.precisionSupport: [fp8]`. Match against a medium-models pool (H100, supports FP8, has 80Gi available). Compose LLMInferenceService with `workerNodeSize: 1` (no LeaderWorkerSet, just a regular Deployment), GPU limit 1, nodeSelector for the medium pool. Single-node deployments are the special case where multi-node machinery isn't needed; the schema accommodates both uniformly.

This proposal sketches one shape but the design space is meaningfully bigger than what I've laid out, and getting it wrong is expensive — schema decisions made here will be hard to change once operators have ClusterModels in production. This needs a full design that accounts for the different topology patterns and the tradeoffs between them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replica topology and capability-based scheduling #52

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Replica topology and capability-based scheduling #52

Description

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions