What problem are you facing?
Modelplane can't deploy multi-node models. Kimi K2, DeepSeek V3, Llama 4 Behemoth, and similar frontier models require deployment shapes the current schema can't express — multiple nodes coordinating via high-bandwidth interconnect, structured parallelism strategy, hardware capability requirements (FP8 support, NVLink, IB-400g+). The deploy function divides total VRAM by per-GPU memory to compute GPU count. For Kimi K2 this gives a number but not a deployable configuration. There's no way to tell the scheduler that 16 GPUs need to be 2 nodes of 8 with TP=8 + PP=2, not 1 node of 16 (which doesn't exist) or 16 nodes of 1 (wrong sharding).
How could Modelplane help solve your problem?
Frontier models share a deployment shape. Kimi K2 (1T/32B MoE) needs 16 GPUs across 2 H200 nodes connected by IB-Quantum-2, running TP=8 + PP=2 with expert parallelism enabled, FP8 quantization, 141Gi per GPU. DeepSeek V3 has similar requirements. Llama 405B at full precision spans multiple nodes. Multi-node deployment with structured parallelism and hardware-specific requirements is the production pattern for serious models, not an edge case.
KServe LLMInferenceService already supports this. The parallelism block expresses tensor and pipeline parallelism structurally. The workerNodeSize field tells KServe to compose a multi-pod deployment across nodes via LeaderWorkerSet. Each pod gets its share of GPUs; KServe handles the Ray coordination. We just don't expose any of this through the Modelplane API today.
The API needs three things: a way to express the shape of a replica, a way to declare what hardware capabilities pools have, and a matching algorithm that puts them together correctly.
I think the natural place for shape is a replicaTopology block on the serving profile. The block describes one independent serving instance — what hardware it needs, how it shards across nodes, what placement constraints apply. The current resources.vram field is removed (subsumed by the topology block).
apiVersion: modelplane.ai/v1alpha1
kind: ClusterModel
metadata:
name: kimi-k2-instruct
spec:
model:
name: moonshotai/Kimi-K2-Instruct
source: HuggingFace
huggingFace:
repo: moonshotai/Kimi-K2-Instruct
environmentSelector:
matchLabels:
modelplane.ai/tier: production
serving:
- name: vllm-h200-multinode
replicaTopology:
nodes: 2
gpusPerNode: 8
parallelism:
tensor: 8
pipeline: 2
expert: enabled
requires:
memoryPerGpu: "141Gi"
interconnect: nvlink
multiNodeBandwidth: ib-400g-or-better
precisionSupport: ["fp8"]
nvlinkDomainSize: 8
placement:
podsAcrossNodes: required
replicasAcrossNodes: preferred
engine:
name: vLLM
image: vllm/vllm-openai:v0.8.0
args:
- "--tensor-parallel-size=8"
- "--pipeline-parallel-size=2"
- "--enable-expert-parallel"
- "--quantization=fp8"
- "--max-model-len=65536"
- "--gpu-memory-utilization=0.90"
- "--enable-prefix-caching"
- "--max-num-seqs=256"
- "--distributed-executor-backend=ray"
The replicaTopology block has clear sub-fields. nodes and gpusPerNode define the deployment shape — KServe's workerNodeSize comes from nodes, and the per-pod GPU resource limit comes from gpusPerNode. parallelism populates KServe's structured parallelism block; this duplicates information that also appears in engine args, but the duplication is unavoidable because KServe needs structured fields for LeaderWorkerSet composition while vLLM needs the args. The serving profile author keeps them consistent.
requires is a list of capability constraints. The deploy function matches these against pool capabilities. Numeric constraints use ≥ comparison (memoryPerGpu: 141Gi means at least 141Gi). Enum constraints use ordered comparison (multiNodeBandwidth: ib-400g-or-better). List constraints use superset comparison (precisionSupport: [fp8] means the pool must support FP8 among its supported formats). These comparisons are simple enough to verify by inspection; v0.1 doesn't try to handle multi-dimensional capability comparison or "or-better" beyond a single ordered enum.
placement controls pod and replica spread. podsAcrossNodes: required is correctness for multi-node replicas — the parallelism doesn't work if both pods land on the same node. replicasAcrossNodes: preferred is HA — multiple replicas should land on different nodes if capacity allows, but can pack if needed.
For the InferenceEnvironment side, node pools need a capabilities block declaring what their hardware supports. The platform engineer provisioning pools knows this; today's schema doesn't have a place to put it.
apiVersion: modelplane.ai/v1alpha1
kind: InferenceEnvironment
metadata:
name: prod-coreweave-us-east
labels:
modelplane.ai/region: us-east
modelplane.ai/tier: production
spec:
cluster:
source: Existing
existing:
secretRef:
name: cw-cluster-kubeconfig
provider: coreweave
instanceType: HGX-H200-IB
nodePools:
- name: medium-models
acceleratorType: nvidia-h100-80gb
acceleratorCount: 8
nodeCount: 0
maxNodeCount: 8
labels:
modelplane.ai/pool: medium
capabilities:
memoryPerGpu: "80Gi"
interconnect: nvlink
precisionSupport: ["fp16", "bf16", "fp8"]
multiNodeCapable: false
nvlinkDomainSize: 8
- name: frontier-multinode
acceleratorType: nvidia-h200-141gb
acceleratorCount: 8
nodeCount: 0
maxNodeCount: 4
labels:
modelplane.ai/pool: frontier
capabilities:
memoryPerGpu: "141Gi"
interconnect: nvlink
precisionSupport: ["fp16", "bf16", "fp8"]
multiNodeCapable: true
interNodeBandwidth: ib-400g
nvlinkDomainSize: 8
The capabilities block is the hardware truth-teller. Pool memory per GPU, interconnect type, supported precisions, multi-node networking. The multiNodeCapable field is a discriminator — pools without it fail multi-node profile matching even if they have the right GPU type, because their networking isn't configured for it.
provider and instanceType on the cluster are infrastructure context. They don't participate in matching at v0.1, but they enable richer matching later (e.g., recipe-based validation that says "this configuration is validated on Coreweave HGX-H200-IB"). Operators authoring their own InferenceEnvironments can leave these blank.
Profile matching gains capability filtering. The deploy function walks profiles in order; for each profile, walks pools in order; first pool whose capabilities satisfy the profile's replicaTopology.requires and has capacity for nodes × gpusPerNode allocation units wins.
For each environment matching ClusterModel.environmentSelector:
For each profile in ClusterModel.serving:
For each pool in environment.nodePools:
If pool.capabilities satisfies profile.replicaTopology.requires:
If pool has capacity for profile.replicaTopology.nodes nodes
with profile.replicaTopology.gpusPerNode GPUs each:
Match found.
Compose ModelPlacement with this profile and pool.
Generate LLMInferenceService:
replicas: from ModelDeployment scaling
workerNodeSize: profile.replicaTopology.nodes
parallelism: profile.replicaTopology.parallelism
template.nodeSelector: pool.labels
template.affinity: from profile.replicaTopology.placement
container.args: profile.engine.args
container.resources.limits[nvidia.com/gpu]: profile.replicaTopology.gpusPerNode
The matching algorithm is deterministic given declaration order. Operators control behavior by ordering profiles (priority) and pools (preference). Two pools matching the same profile in the same environment is resolved by pool declaration order; v0.1 doesn't try to be clever about cost-aware or capability-conservative selection. v0.2+ can add this.
For Kimi K2 specifically, the matching cascade looks like: filter environments by tier: production (the production environment matches), walk profiles (only vllm-h200-multinode), walk pools (medium-models fails on multiNodeCapable=false; frontier-multinode satisfies all requirements), match found, compose LLMInferenceService with workerNodeSize: 2, parallelism: {tensor: 8, pipeline: 2}, anti-affinity required across nodes, GPU limit 8 per pod, nodeSelector targeting the frontier pool's labels.
The composed LLMInferenceService:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: kimi-k2-instruct
namespace: ml-team-research
spec:
model:
uri: hf://moonshotai/Kimi-K2-Instruct
replicas: 1
workerNodeSize: 2
parallelism:
tensor: 8
pipeline: 2
template:
spec:
nodeSelector:
modelplane.ai/pool: frontier
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: kimi-k2-instruct
topologyKey: kubernetes.io/hostname
containers:
- name: server
image: vllm/vllm-openai:v0.8.0
args:
- --model=moonshotai/Kimi-K2-Instruct
- --tensor-parallel-size=8
- --pipeline-parallel-size=2
- --enable-expert-parallel
- --quantization=fp8
- --max-model-len=65536
- --gpu-memory-utilization=0.90
- --enable-prefix-caching
- --max-num-seqs=256
- --distributed-executor-backend=ray
resources:
limits:
nvidia.com/gpu: 8
KServe creates a LeaderWorkerSet with 2 worker pods per replica. The cluster autoscaler provisions H200 nodes from the frontier pool. Pods schedule one per node (anti-affinity required). vLLM starts in each pod, Ray sets up communication over IB, model loads with TP=8 within each node and PP=2 across the two. LLMInferenceService becomes ready.
The same API handles single-node deployments. A Gemma 3 27B profile sets nodes: 1, gpusPerNode: 1, requires.memoryPerGpu: 40Gi, requires.precisionSupport: [fp8]. Match against a medium-models pool (H100, supports FP8, has 80Gi available). Compose LLMInferenceService with workerNodeSize: 1 (no LeaderWorkerSet, just a regular Deployment), GPU limit 1, nodeSelector for the medium pool. Single-node deployments are the special case where multi-node machinery isn't needed; the schema accommodates both uniformly.
This proposal sketches one shape but the design space is meaningfully bigger than what I've laid out, and getting it wrong is expensive — schema decisions made here will be hard to change once operators have ClusterModels in production. This needs a full design that accounts for the different topology patterns and the tradeoffs between them.
What problem are you facing?
Modelplane can't deploy multi-node models. Kimi K2, DeepSeek V3, Llama 4 Behemoth, and similar frontier models require deployment shapes the current schema can't express — multiple nodes coordinating via high-bandwidth interconnect, structured parallelism strategy, hardware capability requirements (FP8 support, NVLink, IB-400g+). The deploy function divides total VRAM by per-GPU memory to compute GPU count. For Kimi K2 this gives a number but not a deployable configuration. There's no way to tell the scheduler that 16 GPUs need to be 2 nodes of 8 with TP=8 + PP=2, not 1 node of 16 (which doesn't exist) or 16 nodes of 1 (wrong sharding).
How could Modelplane help solve your problem?
Frontier models share a deployment shape. Kimi K2 (1T/32B MoE) needs 16 GPUs across 2 H200 nodes connected by IB-Quantum-2, running TP=8 + PP=2 with expert parallelism enabled, FP8 quantization, 141Gi per GPU. DeepSeek V3 has similar requirements. Llama 405B at full precision spans multiple nodes. Multi-node deployment with structured parallelism and hardware-specific requirements is the production pattern for serious models, not an edge case.
KServe
LLMInferenceServicealready supports this. Theparallelismblock expresses tensor and pipeline parallelism structurally. TheworkerNodeSizefield tells KServe to compose a multi-pod deployment across nodes via LeaderWorkerSet. Each pod gets its share of GPUs; KServe handles the Ray coordination. We just don't expose any of this through the Modelplane API today.The API needs three things: a way to express the shape of a replica, a way to declare what hardware capabilities pools have, and a matching algorithm that puts them together correctly.
I think the natural place for shape is a
replicaTopologyblock on the serving profile. The block describes one independent serving instance — what hardware it needs, how it shards across nodes, what placement constraints apply. The currentresources.vramfield is removed (subsumed by the topology block).The
replicaTopologyblock has clear sub-fields.nodesandgpusPerNodedefine the deployment shape — KServe'sworkerNodeSizecomes fromnodes, and the per-pod GPU resource limit comes fromgpusPerNode.parallelismpopulates KServe's structured parallelism block; this duplicates information that also appears in engine args, but the duplication is unavoidable because KServe needs structured fields for LeaderWorkerSet composition while vLLM needs the args. The serving profile author keeps them consistent.requiresis a list of capability constraints. The deploy function matches these against pool capabilities. Numeric constraints use ≥ comparison (memoryPerGpu: 141Gimeans at least 141Gi). Enum constraints use ordered comparison (multiNodeBandwidth: ib-400g-or-better). List constraints use superset comparison (precisionSupport: [fp8]means the pool must support FP8 among its supported formats). These comparisons are simple enough to verify by inspection; v0.1 doesn't try to handle multi-dimensional capability comparison or "or-better" beyond a single ordered enum.placementcontrols pod and replica spread.podsAcrossNodes: requiredis correctness for multi-node replicas — the parallelism doesn't work if both pods land on the same node.replicasAcrossNodes: preferredis HA — multiple replicas should land on different nodes if capacity allows, but can pack if needed.For the InferenceEnvironment side, node pools need a
capabilitiesblock declaring what their hardware supports. The platform engineer provisioning pools knows this; today's schema doesn't have a place to put it.The capabilities block is the hardware truth-teller. Pool memory per GPU, interconnect type, supported precisions, multi-node networking. The
multiNodeCapablefield is a discriminator — pools without it fail multi-node profile matching even if they have the right GPU type, because their networking isn't configured for it.providerandinstanceTypeon the cluster are infrastructure context. They don't participate in matching at v0.1, but they enable richer matching later (e.g., recipe-based validation that says "this configuration is validated on Coreweave HGX-H200-IB"). Operators authoring their own InferenceEnvironments can leave these blank.Profile matching gains capability filtering. The deploy function walks profiles in order; for each profile, walks pools in order; first pool whose
capabilitiessatisfy the profile'sreplicaTopology.requiresand has capacity fornodes × gpusPerNodeallocation units wins.The matching algorithm is deterministic given declaration order. Operators control behavior by ordering profiles (priority) and pools (preference). Two pools matching the same profile in the same environment is resolved by pool declaration order; v0.1 doesn't try to be clever about cost-aware or capability-conservative selection. v0.2+ can add this.
For Kimi K2 specifically, the matching cascade looks like: filter environments by
tier: production(the production environment matches), walk profiles (onlyvllm-h200-multinode), walk pools (medium-models fails on multiNodeCapable=false; frontier-multinode satisfies all requirements), match found, compose LLMInferenceService withworkerNodeSize: 2,parallelism: {tensor: 8, pipeline: 2}, anti-affinity required across nodes, GPU limit 8 per pod, nodeSelector targeting the frontier pool's labels.The composed LLMInferenceService:
KServe creates a LeaderWorkerSet with 2 worker pods per replica. The cluster autoscaler provisions H200 nodes from the frontier pool. Pods schedule one per node (anti-affinity required). vLLM starts in each pod, Ray sets up communication over IB, model loads with TP=8 within each node and PP=2 across the two. LLMInferenceService becomes ready.
The same API handles single-node deployments. A Gemma 3 27B profile sets
nodes: 1, gpusPerNode: 1, requires.memoryPerGpu: 40Gi, requires.precisionSupport: [fp8]. Match against a medium-models pool (H100, supports FP8, has 80Gi available). Compose LLMInferenceService withworkerNodeSize: 1(no LeaderWorkerSet, just a regular Deployment), GPU limit 1, nodeSelector for the medium pool. Single-node deployments are the special case where multi-node machinery isn't needed; the schema accommodates both uniformly.This proposal sketches one shape but the design space is meaningfully bigger than what I've laid out, and getting it wrong is expensive — schema decisions made here will be hard to change once operators have ClusterModels in production. This needs a full design that accounts for the different topology patterns and the tradeoffs between them.