What problem are you facing?
Today's InferenceEnvironment node pool spec mixes hardware capabilities and provisioning details in ways that make matching brittle and don't align with where Kubernetes is heading for accelerator management.
We've been discussing how to express hardware capabilities for matching against ClusterModel requirements (see #55). The core question is what vocabulary and structure to use. Designing this from scratch with our own enums (interconnect: nvlink, precisionSupport: [fp8], etc.) means inventing a parallel taxonomy that we have to maintain ourselves, that diverges from vendor terminology, and that doesn't extend cleanly to multi-vendor scenarios.
There's likely a better answer available: Dynamic Resource Allocation (DRA) reached stable in Kubernetes 1.34 and is now the cornerstone of the Kubernetes AI Conformance program. Major drivers exist for NVIDIA (gpu.nvidia.com), AMD (gpu.amd.com), Google TPUs, and networking (DRANET). Vendor attribute schemas are stable and standardized within each vendor. Cross-vendor topology attributes (resource.kubernetes.io/pcieRoot, NUMA, etc.) are being standardized through KEP-4381, KEP-5316, and KEP-5491 by WG-Device-Management.
This design sketch explores adopting DRA conventions directly rather than invent parallel ones.
There's also a second concern. Node pool spec serves two purposes today, and they're conflated in a way that makes neither clean:
- Matching — ClusterModel serving profiles match against pool capabilities. The deploy function decides which pool can host a deployment by comparing requirements to capabilities.
- Provisioning — for Modelplane-provisioned environments, the node pool spec needs enough information to actually create the pool in the underlying CSP (GKE, EKS, etc.).
Cloud providers are evolving toward separating these concerns. GKE's ComputeClass is the provisioning-side abstraction; ResourceSlice (DRA) is the matching-side abstraction. EKS Karpenter's NodePool is similar to ComputeClass. We should follow the same separation rather than fight against it.
How could Modelplane help solve your problem?
Restructure the InferenceEnvironment node pool spec to separate provisioning from matching, align matching attributes with DRA conventions, and structure the matching surface as three distinct levels — environment, node, and device.
Three levels of attributes
Different match concerns belong at different levels. Conflating them into a flat space loses the semantic structure that makes federation matching tractable.
Environment-level attributes describe the InferenceEnvironment as a whole. They don't vary across nodes or devices. Examples: region, cloud provider, operational tier, compliance posture, network access, data residency.
Node-level attributes describe nodes within a pool. They're uniform across nodes in a single pool but vary across pools (which is why pools exist as separate resources). Examples: instance type, networking fabric, pricing model, failure domain.
Device-level attributes describe individual GPUs or accelerator devices. They follow DRA's vendor-namespaced conventions. Examples: architecture, memory, precision support, compute capability.
The API reflects this:
apiVersion: modelplane.ai/v1alpha1
kind: InferenceEnvironment
metadata:
name: prod-coreweave-us-east
spec:
cluster:
source: Existing
existing:
secretRef:
name: cw-cluster-kubeconfig
# Environment-level attributes
attributes:
"modelplane.ai/region":
string: "us-east"
"modelplane.ai/provider":
string: "coreweave"
"modelplane.ai/tier":
string: "production"
"modelplane.ai/compliance":
list:
string: ["soc2-type2", "iso-27001"]
"modelplane.ai/networkAccess":
string: "internet-routable"
nodePools:
- name: frontier-multinode
# Provisioning section — used to create the node pool
# CSP-namespaced because each cloud has different provisioning primitives
# Optional for BYO environments where the pool already exists
provisioning:
gke:
machineType: a3-megagpu-8g
gpu:
type: nvidia-h200-141gb
count: 8
networking:
acceleratorNetworkProfile: auto
pricing: on-demand
autoscaling:
min: 0
max: 4
# Node-level attributes — uniform across nodes in this pool
nodeAttributes:
"node.kubernetes.io/instance-type":
string: "a3-megagpu-8g"
"modelplane.ai/pricing":
string: "on-demand"
"modelplane.ai/networking":
string: "ib-quantum-2"
"modelplane.ai/failureDomain":
string: "us-east1-a"
# Device-level attributes — follow DRA vendor conventions
deviceAttributes:
"gpu.nvidia.com/architecture":
string: "Hopper"
"gpu.nvidia.com/productName":
string: "NVIDIA H200 141GB HBM3e"
"gpu.nvidia.com/cudaComputeCapability":
version: "9.0.0"
"gpu.nvidia.com/memory":
quantity: "141Gi"
"gpu.nvidia.com/perDeviceCount":
int: 8
# Cross-vendor topology
"resource.kubernetes.io/pcieRoot":
list:
string: ["pci0000:00", "pci0000:80"]
The two pool sections (provisioning and attributes) and the three-level attribute structure (environment, node, device) serve different purposes:
Provisioning is a one-time concern. It tells Modelplane (or operators using gcloud/eksctl/etc.) what infrastructure to create. CSP-specific because creating a node pool on GKE differs from EKS differs from Coreweave.
Environment attributes describe environment-wide properties. The federation match (which environment to deploy to) reads from these.
Node attributes describe pool-specific properties that aren't device-specific. The placement match (which pool within a chosen environment) reads from these.
Device attributes describe per-GPU properties using DRA conventions. The capacity match (how many devices, with what hardware features) reads from these.
How matching uses these levels
ClusterModel serving profiles have three corresponding claims, evaluated as a cascade:
serving:
- name: vllm-h200-multinode
# Environment-level requirements (filters environments)
environmentClaim:
selector:
cel: |
environment.attributes["modelplane.ai/tier"] == "production" &&
"soc2-type2" in environment.attributes["modelplane.ai/compliance"]
# Node-level requirements (filters pools within matched environments)
nodeClaim:
selector:
cel: |
node.attributes["modelplane.ai/networking"].startsWith("ib-")
# Device-level requirements (matches devices within matched pools)
deviceClaim:
requests:
- name: gpus
count: 16
perNode: 8
selector:
cel: |
device.attributes["gpu.nvidia.com/architecture"] == "Hopper" &&
device.capacity["gpu.nvidia.com/memory"].compareTo(quantity("141Gi")) >= 0
constraints:
- matchAttribute: "node"
parallelism:
tensor: 8
pipeline: 2
engine:
name: vLLM
image: vllm/vllm-openai:v0.8.0
args: [...]
The matching cascade:
- Filter environments by environmentClaim. Walk all InferenceEnvironments. Evaluate the CEL expression against
environment.attributes. Surviving environments proceed.
- Filter pools by nodeClaim within surviving environments. For each candidate environment, walk node pools. Evaluate CEL against
pool.nodeAttributes. Surviving pools proceed.
- Match devices by deviceClaim within surviving pools. Evaluate CEL against
pool.deviceAttributes. If the pool satisfies for all devices and has capacity for count allocation units distributed perNode, match found.
- Compose the placement. Select an environment, a pool within it, and emit a ModelPlacement that produces the LLMInferenceService.
Most ClusterModels won't need all three claims. A simple Gemma profile might only have a deviceClaim — the environment and node claims default to "match anything" if omitted. The structure is there when needed but doesn't add ceremony to simple cases.
# Simple Gemma profile — only deviceClaim, others default
serving:
- name: vllm-fp8
deviceClaim:
requests:
- name: gpu
count: 1
perNode: 1
selector:
cel: |
device.attributes["gpu.nvidia.com/memory"].compareTo(quantity("40Gi")) >= 0 &&
"fp8-e4m3" in device.attributes["gpu.nvidia.com/precisions"]
engine:
name: vLLM
image: vllm/vllm-openai:v0.8.0
args: ["--quantization=fp8"]
Why this is three levels, not one
A flat attribute space (everything as device attributes) would work mechanically but loses semantic structure:
- Environment-level facts would be duplicated to every device. Region, compliance posture, tier — published on every device's attribute set. Redundant and obscures the federation layer's role.
- Node-level facts get conflated with device-level facts. Networking fabric is a node property, not a device property. NIC and GPU on the same node share the fabric; they don't each have their own. Forcing both into device attributes blurs the distinction.
- The federation match disappears as a first-class concept. Modelplane's distinguishing value is federation across environments. The first match question is which environment, before any device matters. Promoting environment-level matching to first-class makes the federation visible in the API.
- Match performance suffers. A flat space requires every claim to filter the full attribute set; a leveled space narrows candidates at each step. For large fleets, this matters.
The three-level structure aligns with how DRA itself thinks about scheduling — filter nodes first by node properties, then satisfy device claims on surviving nodes — but adds the environment level on top for federation.
A future consideration: discovered attributes from runtime DRA
Worth flagging but not part of this design: as DRA drivers become widely deployed, Modelplane could optionally read attributes directly from runtime ResourceSlices on Kubernetes-backed environments where the driver runs, rather than requiring operators to declare them. This is a useful enhancement but not foundational — Modelplane's matching needs to happen at scheduling time before pools have running nodes (a pool at scale-zero has no ResourceSlices to read), so declared attributes in spec remain authoritative regardless. Runtime DRA data could be useful for verification and drift detection (does the actual hardware match what was declared?), but the matching path operates on declared attributes. The proposed API treats declaration as the source of truth; future work can layer discovery on top without changing how matching works.
Why this approach
A few reasons this approach seems worth pursuing:
Aligns with where Kubernetes is heading. DRA is GA and is the cornerstone of the Kubernetes AI Conformance program. Vendor drivers (NVIDIA, AMD, Google TPU) publish attributes in their respective namespaces. Cross-vendor topology is being standardized through KEPs in the resource.kubernetes.io/ namespace. We get the benefit of ongoing standardization work.
No parallel vocabulary to maintain. Modelplane doesn't need to define what architecture: Hopper means. NVIDIA's driver does. When NVIDIA ships Blackwell, the attribute value Blackwell becomes available without Modelplane API changes. Same for new precision formats, new memory tiers, new compute capabilities.
Multi-vendor support is straightforward. AMD pools publish gpu.amd.com/* attributes; NVIDIA pools publish gpu.nvidia.com/*; TPU pools publish tpu.google.com/*. The API accommodates all of them. ClusterModels with vendor-specific profiles work; ClusterModels with vendor-agnostic profiles using resource.kubernetes.io/* topology attributes work. Vendor abstraction is solved at the DRA layer rather than reinvented in Modelplane.
Three-level matching makes federation explicit. Environment, node, and device levels each have clear semantic content. The federation layer (environment-level matching) is first-class in the API rather than buried in label selectors.
Provisioning and matching have clear separation. The dual role of node pool spec is honest about being two different concerns. CSP-specific provisioning fields don't pollute the matching surface. The matching surface uses standard vocabulary that doesn't depend on which CSP the pool runs on.
KServe and DRA integration is the eventual path. Today KServe consumes nvidia.com/gpu extended resources. As DRA driver adoption grows, KServe will likely add ResourceClaim support, at which point Modelplane composing LLMInferenceService with DRA claims becomes natural. Matching can happen at Modelplane's level today, with LLMInferenceService still using extended resources, and migrate to DRA claims when KServe supports them.
CEL expressions handle comparison ambiguity. Numeric comparisons (memory >= quantity("141Gi")), set membership ("fp8-e4m3" in precisions), and string equality are all unambiguous. No more "is interconnect: nvlink exact-match or at-least?" The CEL expression says exactly what it means.
Questions for the design
A few specifics the design needs to work through:
Built-in machine type catalog for derived attributes. For Modelplane-provisioned environments, the mapping from machine types (a3-megagpu-8g, p5.48xlarge, etc.) to expected attributes is bounded but real maintenance. Where does this catalog live? Probably in the Modelplane repo or a sibling catalog repo, with clear contribution model and update cadence as cloud providers add machine types.
Provisioning translation per CSP. Modelplane's provisioning.gke section needs to translate to either GKE ComputeClass resources or direct GKE API calls. provisioning.eks needs to translate to Karpenter NodePool or managed node group definitions. Each translation is discrete CSP-specific code. Worth deciding the boundary — does Modelplane invoke cloud APIs directly, or does it create cloud-native resources (ComputeClass, Karpenter NodePool) that the cloud's own controllers reconcile? The latter is probably cleaner long-term.
Modelplane-specific attribute namespace. Some facts don't map cleanly to vendor or resource.kubernetes.io/ namespaces — multi-node networking fabric type (when not standardized), infrastructure provider context, operational tier, pricing model. These need a modelplane.ai/* namespace. The vocabulary is bounded but needs design.
Override semantics for derived attributes. For Modelplane-provisioned environments, derived attributes have defaults but the operator may need to override them (unusual configurations, custom networking). The override mechanism needs to be clear.
Capacity versus attributes distinction. DRA distinguishes capacity (consumable resources like memory bytes) from attributes (descriptive properties like architecture). Modelplane's pool spec should follow the same distinction. Per-device memory is capacity; architecture is an attribute.
Engagement with the broader ecosystem
The standardization work for cross-vendor attributes (KEP-4381, KEP-5316, KEP-5491) is ongoing in WG-Device-Management with NVIDIA, AMD, and Google participating. Modelplane has a federation-layer perspective that could contribute useful input on what attributes matter for cross-environment placement and what's missing from current standardization. Engaging WG-Device-Management and SIG-Scheduling during the design phase — through issues, KEPs, or working group meetings — is worth considering, since aligning with the ecosystem early is meaningfully cheaper than conforming later.
What this changes about the existing API
This is a meaningful API rework on the InferenceEnvironment side. The main shifts:
- Remove the existing GKE-specific
gpu block at the top level of node pools
- Add
provisioning section that's CSP-namespaced
- Add environment-level
attributes on InferenceEnvironment.spec
- Add
nodeAttributes on each node pool for pool-uniform node properties
- Add
deviceAttributes on each node pool for DRA-conformant device properties
- Update status to mirror declared attributes
ClusterModel serving profiles need updating in tandem (per #topology-and-capabilities-issue) to use three-claim CEL expressions matching against attributes at each level. The two issues should be designed together since the matching algorithm spans both.
This isn't a small change but the alternatives (inventing parallel vocabularies, conflating provisioning with matching, fighting against where Kubernetes is heading for accelerator management) are worse. Better to do this work once, properly, with the ecosystem rather than against it.
References
What problem are you facing?
Today's InferenceEnvironment node pool spec mixes hardware capabilities and provisioning details in ways that make matching brittle and don't align with where Kubernetes is heading for accelerator management.
We've been discussing how to express hardware capabilities for matching against ClusterModel requirements (see #55). The core question is what vocabulary and structure to use. Designing this from scratch with our own enums (
interconnect: nvlink,precisionSupport: [fp8], etc.) means inventing a parallel taxonomy that we have to maintain ourselves, that diverges from vendor terminology, and that doesn't extend cleanly to multi-vendor scenarios.There's likely a better answer available: Dynamic Resource Allocation (DRA) reached stable in Kubernetes 1.34 and is now the cornerstone of the Kubernetes AI Conformance program. Major drivers exist for NVIDIA (
gpu.nvidia.com), AMD (gpu.amd.com), Google TPUs, and networking (DRANET). Vendor attribute schemas are stable and standardized within each vendor. Cross-vendor topology attributes (resource.kubernetes.io/pcieRoot, NUMA, etc.) are being standardized through KEP-4381, KEP-5316, and KEP-5491 by WG-Device-Management.This design sketch explores adopting DRA conventions directly rather than invent parallel ones.
There's also a second concern. Node pool spec serves two purposes today, and they're conflated in a way that makes neither clean:
Cloud providers are evolving toward separating these concerns. GKE's ComputeClass is the provisioning-side abstraction; ResourceSlice (DRA) is the matching-side abstraction. EKS Karpenter's NodePool is similar to ComputeClass. We should follow the same separation rather than fight against it.
How could Modelplane help solve your problem?
Restructure the InferenceEnvironment node pool spec to separate provisioning from matching, align matching attributes with DRA conventions, and structure the matching surface as three distinct levels — environment, node, and device.
Three levels of attributes
Different match concerns belong at different levels. Conflating them into a flat space loses the semantic structure that makes federation matching tractable.
Environment-level attributes describe the InferenceEnvironment as a whole. They don't vary across nodes or devices. Examples: region, cloud provider, operational tier, compliance posture, network access, data residency.
Node-level attributes describe nodes within a pool. They're uniform across nodes in a single pool but vary across pools (which is why pools exist as separate resources). Examples: instance type, networking fabric, pricing model, failure domain.
Device-level attributes describe individual GPUs or accelerator devices. They follow DRA's vendor-namespaced conventions. Examples: architecture, memory, precision support, compute capability.
The API reflects this:
The two pool sections (provisioning and attributes) and the three-level attribute structure (environment, node, device) serve different purposes:
Provisioning is a one-time concern. It tells Modelplane (or operators using gcloud/eksctl/etc.) what infrastructure to create. CSP-specific because creating a node pool on GKE differs from EKS differs from Coreweave.
Environment attributes describe environment-wide properties. The federation match (which environment to deploy to) reads from these.
Node attributes describe pool-specific properties that aren't device-specific. The placement match (which pool within a chosen environment) reads from these.
Device attributes describe per-GPU properties using DRA conventions. The capacity match (how many devices, with what hardware features) reads from these.
How matching uses these levels
ClusterModel serving profiles have three corresponding claims, evaluated as a cascade:
The matching cascade:
environment.attributes. Surviving environments proceed.pool.nodeAttributes. Surviving pools proceed.pool.deviceAttributes. If the pool satisfies for all devices and has capacity forcountallocation units distributedperNode, match found.Most ClusterModels won't need all three claims. A simple Gemma profile might only have a deviceClaim — the environment and node claims default to "match anything" if omitted. The structure is there when needed but doesn't add ceremony to simple cases.
Why this is three levels, not one
A flat attribute space (everything as device attributes) would work mechanically but loses semantic structure:
The three-level structure aligns with how DRA itself thinks about scheduling — filter nodes first by node properties, then satisfy device claims on surviving nodes — but adds the environment level on top for federation.
A future consideration: discovered attributes from runtime DRA
Worth flagging but not part of this design: as DRA drivers become widely deployed, Modelplane could optionally read attributes directly from runtime ResourceSlices on Kubernetes-backed environments where the driver runs, rather than requiring operators to declare them. This is a useful enhancement but not foundational — Modelplane's matching needs to happen at scheduling time before pools have running nodes (a pool at scale-zero has no ResourceSlices to read), so declared attributes in spec remain authoritative regardless. Runtime DRA data could be useful for verification and drift detection (does the actual hardware match what was declared?), but the matching path operates on declared attributes. The proposed API treats declaration as the source of truth; future work can layer discovery on top without changing how matching works.
Why this approach
A few reasons this approach seems worth pursuing:
Aligns with where Kubernetes is heading. DRA is GA and is the cornerstone of the Kubernetes AI Conformance program. Vendor drivers (NVIDIA, AMD, Google TPU) publish attributes in their respective namespaces. Cross-vendor topology is being standardized through KEPs in the
resource.kubernetes.io/namespace. We get the benefit of ongoing standardization work.No parallel vocabulary to maintain. Modelplane doesn't need to define what
architecture: Hoppermeans. NVIDIA's driver does. When NVIDIA ships Blackwell, the attribute valueBlackwellbecomes available without Modelplane API changes. Same for new precision formats, new memory tiers, new compute capabilities.Multi-vendor support is straightforward. AMD pools publish
gpu.amd.com/*attributes; NVIDIA pools publishgpu.nvidia.com/*; TPU pools publishtpu.google.com/*. The API accommodates all of them. ClusterModels with vendor-specific profiles work; ClusterModels with vendor-agnostic profiles usingresource.kubernetes.io/*topology attributes work. Vendor abstraction is solved at the DRA layer rather than reinvented in Modelplane.Three-level matching makes federation explicit. Environment, node, and device levels each have clear semantic content. The federation layer (environment-level matching) is first-class in the API rather than buried in label selectors.
Provisioning and matching have clear separation. The dual role of node pool spec is honest about being two different concerns. CSP-specific provisioning fields don't pollute the matching surface. The matching surface uses standard vocabulary that doesn't depend on which CSP the pool runs on.
KServe and DRA integration is the eventual path. Today KServe consumes
nvidia.com/gpuextended resources. As DRA driver adoption grows, KServe will likely add ResourceClaim support, at which point Modelplane composing LLMInferenceService with DRA claims becomes natural. Matching can happen at Modelplane's level today, with LLMInferenceService still using extended resources, and migrate to DRA claims when KServe supports them.CEL expressions handle comparison ambiguity. Numeric comparisons (
memory >= quantity("141Gi")), set membership ("fp8-e4m3" in precisions), and string equality are all unambiguous. No more "isinterconnect: nvlinkexact-match or at-least?" The CEL expression says exactly what it means.Questions for the design
A few specifics the design needs to work through:
Built-in machine type catalog for derived attributes. For Modelplane-provisioned environments, the mapping from machine types (
a3-megagpu-8g,p5.48xlarge, etc.) to expected attributes is bounded but real maintenance. Where does this catalog live? Probably in the Modelplane repo or a sibling catalog repo, with clear contribution model and update cadence as cloud providers add machine types.Provisioning translation per CSP. Modelplane's
provisioning.gkesection needs to translate to either GKE ComputeClass resources or direct GKE API calls.provisioning.eksneeds to translate to Karpenter NodePool or managed node group definitions. Each translation is discrete CSP-specific code. Worth deciding the boundary — does Modelplane invoke cloud APIs directly, or does it create cloud-native resources (ComputeClass, Karpenter NodePool) that the cloud's own controllers reconcile? The latter is probably cleaner long-term.Modelplane-specific attribute namespace. Some facts don't map cleanly to vendor or
resource.kubernetes.io/namespaces — multi-node networking fabric type (when not standardized), infrastructure provider context, operational tier, pricing model. These need amodelplane.ai/*namespace. The vocabulary is bounded but needs design.Override semantics for derived attributes. For Modelplane-provisioned environments, derived attributes have defaults but the operator may need to override them (unusual configurations, custom networking). The override mechanism needs to be clear.
Capacity versus attributes distinction. DRA distinguishes
capacity(consumable resources like memory bytes) fromattributes(descriptive properties like architecture). Modelplane's pool spec should follow the same distinction. Per-device memory is capacity; architecture is an attribute.Engagement with the broader ecosystem
The standardization work for cross-vendor attributes (KEP-4381, KEP-5316, KEP-5491) is ongoing in WG-Device-Management with NVIDIA, AMD, and Google participating. Modelplane has a federation-layer perspective that could contribute useful input on what attributes matter for cross-environment placement and what's missing from current standardization. Engaging WG-Device-Management and SIG-Scheduling during the design phase — through issues, KEPs, or working group meetings — is worth considering, since aligning with the ecosystem early is meaningfully cheaper than conforming later.
What this changes about the existing API
This is a meaningful API rework on the InferenceEnvironment side. The main shifts:
gpublock at the top level of node poolsprovisioningsection that's CSP-namespacedattributeson InferenceEnvironment.specnodeAttributeson each node pool for pool-uniform node propertiesdeviceAttributeson each node pool for DRA-conformant device propertiesClusterModel serving profiles need updating in tandem (per #topology-and-capabilities-issue) to use three-claim CEL expressions matching against attributes at each level. The two issues should be designed together since the matching algorithm spans both.
This isn't a small change but the alternatives (inventing parallel vocabularies, conflating provisioning with matching, fighting against where Kubernetes is heading for accelerator management) are worse. Better to do this work once, properly, with the ecosystem rather than against it.
References