Skip to content

Topology, capabilities, and the design problem of supporting advanced models #55

Description

@bassam

What problem are you facing?

Modelplane today can deploy a dense model that fits on a single GPU. The moment we try to support frontier-class models — Kimi K2 (1T MoE), DeepSeek V3, Llama 4 Behemoth, Mixtral variants at scale — the schema breaks down in ways that aren't fixable by adding fields incrementally.

The deploy function divides total VRAM by per-GPU memory to compute GPU count. For Kimi K2 this gives a number (16 GPUs) but not a deployable configuration. There's no way to express that those 16 GPUs need to be 2 nodes of 8 with TP=8 + PP=2, that the model needs FP8 hardware support, that the pods need to land on different physical nodes for the parallelism to work, that the inter-node fabric has to support at least 400Gbps to keep latency reasonable, or that the FP8 path needs a Hopper or newer Tensor Core. None of these requirements are expressible today.

Same problem in a different shape for smaller models. Gemma 3 27B has at least three valid deployment configurations — FP8 on a single H100, BF16 on 2x H100 with TP=2, INT4 on a single 24GB consumer GPU. Today's schema can express only one configuration per ClusterModel. There's no mechanism for capability-based selection ("prefer FP8 if Hopper is available, BF16 if not, INT4 for development").

We need to extend Modelplane to handle these cases. The question is how.

How could Modelplane help solve your problem?

This isn't a single design proposal — it's a design problem with several plausible shapes, each with real tradeoffs. The community needs to work through the alternatives before settling on an approach.

The deployment knowledge that needs to be captured includes:

Replica shape. Number of nodes per replica, GPUs per node, parallelism strategy (tensor, pipeline, expert), placement constraints (pods across nodes for multi-node correctness, replicas across nodes for HA).

Hardware capability requirements. Memory per GPU, interconnect type (NVLink, PCIe), inter-node bandwidth (IB-400g, RoCE, etc.), precision support (FP8 specifically, with vendor-specific format details), NVLink domain size for rack-scale configurations.

Engine configuration. Image, version requirements, args. Different engines (vLLM, SGLang, NVIDIA Dynamo, TensorRT-LLM) have different argument shapes. Different engine versions support different features.

Variant selection. Multiple valid configurations per model. FP8 vs BF16. Single-node vs multi-node. Quantized vs full precision. Capability-based fallback when the preferred configuration isn't available.

Infrastructure provider context. Where the cluster runs matters. Coreweave HGX-H200-IB has different operational characteristics than AWS p5 with EFA, which differs from GCP A3 with GPUDirect-TCPX, which differs from a DGX with NVLink-Switch. Some configurations work on some providers and not others.

There are several architectural shapes that could accommodate this. Each has different tradeoffs.

Option 1: Native schema with structured topology and capabilities

Add replicaTopology to serving profiles and capabilities to node pools. The deploy function matches profile requirements against pool capabilities directly. Operators author full configurations in YAML.

serving:
- name: vllm-h200-multinode
  replicaTopology:
    nodes: 2
    gpusPerNode: 8
    parallelism: { tensor: 8, pipeline: 2, expert: enabled }
    requires:
      memoryPerGpu: 141Gi
      interconnect: nvlink
      multiNodeBandwidth: ib-400g-or-better
      precisionSupport: [fp8]
  engine:
    name: vLLM
    image: vllm/vllm-openai:v0.8.0
    args: [...]

Tradeoffs:

Pro: Modelplane owns the abstraction. No external dependencies. Custom and fine-tuned models work the same as catalog models. Schema evolution is under our control.

Con: The capability vocabulary has to be designed by us. NVLink, IB-400g, FP8 Tensor Core support — all of this is NVIDIA-shaped. AMD, Intel, future hardware require schema redesign. Capability comparison semantics (is interconnect: nvlink exact-match or "at least"?) need careful design. Operators authoring profiles need deep knowledge of model deployment patterns. Validation evidence isn't in the schema — operators claim hardware works without proof. The maintenance burden of keeping the capability vocabulary current (new precision formats, new fabrics, new accelerators) falls entirely on Modelplane.

Option 2: Consume external recipes (vLLM recipes, HuggingFace metadata)

Reference recipes from external sources rather than designing our own deployment knowledge schema.

vLLM recipes (recipes.vllm.ai, github.com/vllm-project/recipes) is community-maintained, Apache 2.0, structured YAML at models/<hf_org>/<hf_repo>.yaml, with about 90 models covered. Active development with daily PRs. Includes hardware compatibility, parallelism strategies, validated configurations across NVIDIA H100/H200/B200 and AMD MI300X/MI355X. Has a JSON API at recipes.vllm.ai/models.json for programmatic consumption.

HuggingFace model cards have YAML frontmatter (language, license, base_model, pipeline_tag) plus structured config.json with architecture details. Safetensors headers expose parameter counts and dtypes programmatically. Not deployment-focused — no parallelism strategy, no hardware requirements — but enough to derive baseline VRAM estimates.

A ClusterModel could reference a vLLM recipe directly:

spec:
  recipe:
    source: vllm-recipes
    path: moonshotai/Kimi-K2-Instruct
    strategy: tp-multi-node

Tradeoffs:

Pro: We don't reinvent. Existing community catalog with hundreds of contributors. Recipes get updated when models, engines, or hardware change. AMD support is real (multiple recipes have AMD configurations). Infrastructure validation already happening in some recipes. Faster path to working catalog content.

Con: Engine convergence isn't actually happening — vLLM is dominant but SGLang, NVIDIA Dynamo, and TensorRT-LLM all have different recipe ecosystems. Tying Modelplane to vLLM recipes ties us to vLLM specifically. Recipes don't capture infrastructure provider differences (Coreweave vs AWS vs DGX) — that's outside their scope. Recipes don't capture federation behavior (anti-affinity, prefix-cache routing, KV cache strategy) — that has to live somewhere else. We become consumers of an external project's schema decisions; if they pivot, we have a coupling problem. Recipe schema is still evolving.

Option 3: Native schema with optional recipe overlay (design-time tooling)

The base API is fully expressive (Option 1). Recipes are a separate, optional layer — pre-authored ClusterModel and InferenceEnvironment YAML files in a Git repo that operators copy and customize.

modelplane-catalog/
  models/
    moonshotai/kimi-k2-instruct/
      cluster-model.yaml
      environment-examples/
        coreweave-hgx-h200.yaml
        nvidia-dgx-h200.yaml
      README.md

The Modelplane controller never sees a Recipe resource. There's no recipe CRD, no resolution logic. Operators apply YAML; the controller reconciles. Recipes are convenience, not foundation. Multiple catalog sources can coexist (curated catalogs, vLLM's catalog re-published in Modelplane format, infrastructure-provider catalogs, private internal catalogs).

Tradeoffs:

Pro: Architectural cleanliness. No coupling to any external project. Engine independence — vLLM, SGLang, Dynamo all just go in the engine block. Custom models first-class — fine-tuned models work the same as catalog models. Updates are operator-controlled (no surprise breakage from upstream changes). Multiple curators can publish catalogs independently.

Con: Base API has to be fully expressive (Option 1's costs apply). Federation behavior gets specified in every ClusterModel rather than inherited from a recipe (more authoring burden for direct authoring). Validation evidence still has to live somewhere — probably in catalog README files as prose, not schema fields. Catalog maintenance is real engineering and infrastructure cost; someone has to do it. Initial catalog has to be authored rather than consumed from an existing one.

Option 4: Named architecture tokens

Rather than enumerating capability fields, define named architecture tokens that abstract the operational reality. A profile says it needs medium-dense-fp8 or frontier-moe-multinode; a pool says it supports medium-dense-fp8 and small-dense-bf16. The matching is set membership.

serving:
- name: production
  architecture: frontier-moe-multinode
  engine:
    name: vLLM
    args: [...]

# elsewhere
nodePools:
- name: frontier-multinode
  supportedArchitectures: [frontier-moe-multinode, large-dense-fp8]

The architecture catalog (community-maintained) defines what each token means in terms of actual hardware capability.

Tradeoffs:

Pro: Vendor differences are hidden inside the architecture catalog rather than in every pool and profile. AMD pools and NVIDIA pools can both declare supports: [medium-dense-fp8] with the catalog handling the underlying hardware comparison. Capability vocabulary doesn't accumulate vendor-specific values. Adding new hardware is a catalog update, not a schema change.

Con: Token vocabulary becomes a chokepoint — someone has to maintain the architecture catalog. Operational reality leaks somewhere — the architecture definition itself has to encode hardware specifics. Custom hardware or unusual configurations don't fit named architectures cleanly. Token granularity is hard to get right (too few and they don't differentiate; too many and they become enumeration of every variant).

Some specific considerations the design needs to address

Beyond the architectural shape, several specific concerns affect the design:

Vendor scope. The capability vocabulary is NVIDIA-shaped today. AMD's Infinity Fabric isn't NVLink. AMD's FP8 format isn't NVIDIA's. Adding AMD by extending enums produces a schema that pretends to be vendor-neutral while leaking NVIDIA assumptions. Should the schema scope explicitly to NVIDIA initially with multi-vendor as a later redesign? Or design the abstraction to handle multi-vendor from the start?

Capability comparison semantics. Numeric ≥ comparison is unambiguous. Enum comparison is harder. Multi-dimensional comparison (interconnect AND bandwidth AND latency) doesn't reduce to a single ordering. Choices range from exact-match enums to documented ordinal scales to full constraint satisfaction.

Profile selection failure modes. When no profile matches, should the deploy function fail visibly or fall back to a less-preferred configuration? Silent fallback from FP8 to BF16, or from multi-node to single-node, would be a surprising performance cliff. Visible failure forces operators to address gaps explicitly.

Infrastructure provider context. Two H200 environments aren't equivalent if one has IB-400g networking and one doesn't. How explicit should the schema be about provider context (Coreweave vs AWS vs DGX)? Does this go in InferenceEnvironment as fields, in a recipe catalog as documented validation, or both?

Federation behavior. Anti-affinity rules, prefix-cache-aware routing, KV cache transfer for disaggregated serving, fleet-wide cache coordination — these are real federation concerns that affect performance significantly. They don't belong in vLLM recipes (not engine concerns). They might belong as native schema fields, as recipe overlays, or as separate resources.

Engine plurality. vLLM is dominant but the world is multi-engine. The schema should accommodate engine choice without privileging vLLM specifically. Engine version compatibility, args shape, image conventions — all of these vary. How much of this is in the base schema vs in engine-specific extensions vs in recipes?

What this issue is for

This issue captures the design problem and the tradeoff space. It's not a proposal for a specific schema. The next step is a design document that:

  • Picks one of the architectural shapes (or proposes a hybrid) with clear reasoning about why
  • Works through the specific schema implications including the questions above
  • Considers concrete scenarios — Kimi K2, Gemma 3 27B with multiple variants, MIG-partitioned environments, disaggregated prefill/decode (separate issue but related), MoE with expert parallelism, rack-scale NVL72 — and validates the schema handles them
  • References vLLM recipes and HuggingFace metadata as inputs to consider, even if we don't directly consume them

Pointers worth reading before writing the design:

Discussion welcome on any of the options above, or on alternatives not covered here.

Some related issues #53 #52 #34

Metadata

Metadata

Assignees

No one assigned

    Labels

    SchedulingScheduling componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions