Align hardware capabilities design with Kubernetes Dynamic Resource Allocation

### What problem are you facing?

Today's InferenceEnvironment node pool spec mixes hardware capabilities and provisioning details in ways that make matching brittle and don't align with where Kubernetes is heading for accelerator management.

We've been discussing how to express hardware capabilities for matching against ClusterModel requirements (see #55). The core question is what vocabulary and structure to use. Designing this from scratch with our own enums (`interconnect: nvlink`, `precisionSupport: [fp8]`, etc.) means inventing a parallel taxonomy that we have to maintain ourselves, that diverges from vendor terminology, and that doesn't extend cleanly to multi-vendor scenarios.

There's likely a better answer available: Dynamic Resource Allocation (DRA) reached stable in Kubernetes 1.34 and is now the cornerstone of the Kubernetes AI Conformance program. Major drivers exist for NVIDIA (`gpu.nvidia.com`), AMD (`gpu.amd.com`), Google TPUs, and networking (DRANET). Vendor attribute schemas are stable and standardized within each vendor. Cross-vendor topology attributes (`resource.kubernetes.io/pcieRoot`, NUMA, etc.) are being standardized through KEP-4381, KEP-5316, and KEP-5491 by WG-Device-Management.

This design sketch explores adopting DRA conventions directly rather than invent parallel ones.

There's also a second concern. Node pool spec serves two purposes today, and they're conflated in a way that makes neither clean:

1. **Matching** — ClusterModel serving profiles match against pool capabilities. The deploy function decides which pool can host a deployment by comparing requirements to capabilities.
2. **Provisioning** — for Modelplane-provisioned environments, the node pool spec needs enough information to actually create the pool in the underlying CSP (GKE, EKS, etc.).

Cloud providers are evolving toward separating these concerns. GKE's ComputeClass is the provisioning-side abstraction; ResourceSlice (DRA) is the matching-side abstraction. EKS Karpenter's NodePool is similar to ComputeClass. We should follow the same separation rather than fight against it.

### How could Modelplane help solve your problem?

Restructure the InferenceEnvironment node pool spec to separate provisioning from matching, align matching attributes with DRA conventions, and structure the matching surface as three distinct levels — environment, node, and device.

#### Three levels of attributes

Different match concerns belong at different levels. Conflating them into a flat space loses the semantic structure that makes federation matching tractable.

**Environment-level attributes** describe the InferenceEnvironment as a whole. They don't vary across nodes or devices. Examples: region, cloud provider, operational tier, compliance posture, network access, data residency.

**Node-level attributes** describe nodes within a pool. They're uniform across nodes in a single pool but vary across pools (which is why pools exist as separate resources). Examples: instance type, networking fabric, pricing model, failure domain.

**Device-level attributes** describe individual GPUs or accelerator devices. They follow DRA's vendor-namespaced conventions. Examples: architecture, memory, precision support, compute capability.

The API reflects this:

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: InferenceEnvironment
metadata:
  name: prod-coreweave-us-east

spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: cw-cluster-kubeconfig
  
  # Environment-level attributes
  attributes:
    "modelplane.ai/region":
      string: "us-east"
    "modelplane.ai/provider":
      string: "coreweave"
    "modelplane.ai/tier":
      string: "production"
    "modelplane.ai/compliance":
      list:
        string: ["soc2-type2", "iso-27001"]
    "modelplane.ai/networkAccess":
      string: "internet-routable"
  
  nodePools:
  - name: frontier-multinode
    
    # Provisioning section — used to create the node pool
    # CSP-namespaced because each cloud has different provisioning primitives
    # Optional for BYO environments where the pool already exists
    provisioning:
      gke:
        machineType: a3-megagpu-8g
        gpu:
          type: nvidia-h200-141gb
          count: 8
        networking:
          acceleratorNetworkProfile: auto
        pricing: on-demand
        autoscaling:
          min: 0
          max: 4
    
    # Node-level attributes — uniform across nodes in this pool
    nodeAttributes:
      "node.kubernetes.io/instance-type":
        string: "a3-megagpu-8g"
      "modelplane.ai/pricing":
        string: "on-demand"
      "modelplane.ai/networking":
        string: "ib-quantum-2"
      "modelplane.ai/failureDomain":
        string: "us-east1-a"
    
    # Device-level attributes — follow DRA vendor conventions
    deviceAttributes:
      "gpu.nvidia.com/architecture":
        string: "Hopper"
      "gpu.nvidia.com/productName":
        string: "NVIDIA H200 141GB HBM3e"
      "gpu.nvidia.com/cudaComputeCapability":
        version: "9.0.0"
      "gpu.nvidia.com/memory":
        quantity: "141Gi"
      "gpu.nvidia.com/perDeviceCount":
        int: 8
      
      # Cross-vendor topology
      "resource.kubernetes.io/pcieRoot":
        list:
          string: ["pci0000:00", "pci0000:80"]
```

The two pool sections (provisioning and attributes) and the three-level attribute structure (environment, node, device) serve different purposes:

**Provisioning** is a one-time concern. It tells Modelplane (or operators using gcloud/eksctl/etc.) what infrastructure to create. CSP-specific because creating a node pool on GKE differs from EKS differs from Coreweave.

**Environment attributes** describe environment-wide properties. The federation match (which environment to deploy to) reads from these.

**Node attributes** describe pool-specific properties that aren't device-specific. The placement match (which pool within a chosen environment) reads from these.

**Device attributes** describe per-GPU properties using DRA conventions. The capacity match (how many devices, with what hardware features) reads from these.

#### How matching uses these levels

ClusterModel serving profiles have three corresponding claims, evaluated as a cascade:

```yaml
serving:
- name: vllm-h200-multinode
  
  # Environment-level requirements (filters environments)
  environmentClaim:
    selector:
      cel: |
        environment.attributes["modelplane.ai/tier"] == "production" &&
        "soc2-type2" in environment.attributes["modelplane.ai/compliance"]
  
  # Node-level requirements (filters pools within matched environments)
  nodeClaim:
    selector:
      cel: |
        node.attributes["modelplane.ai/networking"].startsWith("ib-")
  
  # Device-level requirements (matches devices within matched pools)
  deviceClaim:
    requests:
    - name: gpus
      count: 16
      perNode: 8
      selector:
        cel: |
          device.attributes["gpu.nvidia.com/architecture"] == "Hopper" &&
          device.capacity["gpu.nvidia.com/memory"].compareTo(quantity("141Gi")) >= 0
      constraints:
      - matchAttribute: "node"
  
  parallelism:
    tensor: 8
    pipeline: 2
  
  engine:
    name: vLLM
    image: vllm/vllm-openai:v0.8.0
    args: [...]
```

The matching cascade:

1. **Filter environments by environmentClaim.** Walk all InferenceEnvironments. Evaluate the CEL expression against `environment.attributes`. Surviving environments proceed.
2. **Filter pools by nodeClaim within surviving environments.** For each candidate environment, walk node pools. Evaluate CEL against `pool.nodeAttributes`. Surviving pools proceed.
3. **Match devices by deviceClaim within surviving pools.** Evaluate CEL against `pool.deviceAttributes`. If the pool satisfies for all devices and has capacity for `count` allocation units distributed `perNode`, match found.
4. **Compose the placement.** Select an environment, a pool within it, and emit a ModelPlacement that produces the LLMInferenceService.

Most ClusterModels won't need all three claims. A simple Gemma profile might only have a deviceClaim — the environment and node claims default to "match anything" if omitted. The structure is there when needed but doesn't add ceremony to simple cases.

```yaml
# Simple Gemma profile — only deviceClaim, others default
serving:
- name: vllm-fp8
  deviceClaim:
    requests:
    - name: gpu
      count: 1
      perNode: 1
      selector:
        cel: |
          device.attributes["gpu.nvidia.com/memory"].compareTo(quantity("40Gi")) >= 0 &&
          "fp8-e4m3" in device.attributes["gpu.nvidia.com/precisions"]
  engine:
    name: vLLM
    image: vllm/vllm-openai:v0.8.0
    args: ["--quantization=fp8"]
```

#### Why this is three levels, not one

A flat attribute space (everything as device attributes) would work mechanically but loses semantic structure:

- **Environment-level facts would be duplicated to every device.** Region, compliance posture, tier — published on every device's attribute set. Redundant and obscures the federation layer's role.
- **Node-level facts get conflated with device-level facts.** Networking fabric is a node property, not a device property. NIC and GPU on the same node share the fabric; they don't each have their own. Forcing both into device attributes blurs the distinction.
- **The federation match disappears as a first-class concept.** Modelplane's distinguishing value is federation across environments. The first match question is which environment, before any device matters. Promoting environment-level matching to first-class makes the federation visible in the API.
- **Match performance suffers.** A flat space requires every claim to filter the full attribute set; a leveled space narrows candidates at each step. For large fleets, this matters.

The three-level structure aligns with how DRA itself thinks about scheduling — filter nodes first by node properties, then satisfy device claims on surviving nodes — but adds the environment level on top for federation.

### A future consideration: discovered attributes from runtime DRA

Worth flagging but not part of this design: as DRA drivers become widely deployed, Modelplane could optionally read attributes directly from runtime ResourceSlices on Kubernetes-backed environments where the driver runs, rather than requiring operators to declare them. This is a useful enhancement but not foundational — Modelplane's matching needs to happen at scheduling time before pools have running nodes (a pool at scale-zero has no ResourceSlices to read), so declared attributes in spec remain authoritative regardless. Runtime DRA data could be useful for verification and drift detection (does the actual hardware match what was declared?), but the matching path operates on declared attributes. The proposed API treats declaration as the source of truth; future work can layer discovery on top without changing how matching works.

### Why this approach

A few reasons this approach seems worth pursuing:

**Aligns with where Kubernetes is heading.** DRA is GA and is the cornerstone of the Kubernetes AI Conformance program. Vendor drivers (NVIDIA, AMD, Google TPU) publish attributes in their respective namespaces. Cross-vendor topology is being standardized through KEPs in the `resource.kubernetes.io/` namespace. We get the benefit of ongoing standardization work.

**No parallel vocabulary to maintain.** Modelplane doesn't need to define what `architecture: Hopper` means. NVIDIA's driver does. When NVIDIA ships Blackwell, the attribute value `Blackwell` becomes available without Modelplane API changes. Same for new precision formats, new memory tiers, new compute capabilities.

**Multi-vendor support is straightforward.** AMD pools publish `gpu.amd.com/*` attributes; NVIDIA pools publish `gpu.nvidia.com/*`; TPU pools publish `tpu.google.com/*`. The API accommodates all of them. ClusterModels with vendor-specific profiles work; ClusterModels with vendor-agnostic profiles using `resource.kubernetes.io/*` topology attributes work. Vendor abstraction is solved at the DRA layer rather than reinvented in Modelplane.

**Three-level matching makes federation explicit.** Environment, node, and device levels each have clear semantic content. The federation layer (environment-level matching) is first-class in the API rather than buried in label selectors.

**Provisioning and matching have clear separation.** The dual role of node pool spec is honest about being two different concerns. CSP-specific provisioning fields don't pollute the matching surface. The matching surface uses standard vocabulary that doesn't depend on which CSP the pool runs on.

**KServe and DRA integration is the eventual path.** Today KServe consumes `nvidia.com/gpu` extended resources. As DRA driver adoption grows, KServe will likely add ResourceClaim support, at which point Modelplane composing LLMInferenceService with DRA claims becomes natural. Matching can happen at Modelplane's level today, with LLMInferenceService still using extended resources, and migrate to DRA claims when KServe supports them.

**CEL expressions handle comparison ambiguity.** Numeric comparisons (`memory >= quantity("141Gi")`), set membership (`"fp8-e4m3" in precisions`), and string equality are all unambiguous. No more "is `interconnect: nvlink` exact-match or at-least?" The CEL expression says exactly what it means.

### Questions for the design

A few specifics the design needs to work through:

**Built-in machine type catalog for derived attributes.** For Modelplane-provisioned environments, the mapping from machine types (`a3-megagpu-8g`, `p5.48xlarge`, etc.) to expected attributes is bounded but real maintenance. Where does this catalog live? Probably in the Modelplane repo or a sibling catalog repo, with clear contribution model and update cadence as cloud providers add machine types.

**Provisioning translation per CSP.** Modelplane's `provisioning.gke` section needs to translate to either GKE ComputeClass resources or direct GKE API calls. `provisioning.eks` needs to translate to Karpenter NodePool or managed node group definitions. Each translation is discrete CSP-specific code. Worth deciding the boundary — does Modelplane invoke cloud APIs directly, or does it create cloud-native resources (ComputeClass, Karpenter NodePool) that the cloud's own controllers reconcile? The latter is probably cleaner long-term.

**Modelplane-specific attribute namespace.** Some facts don't map cleanly to vendor or `resource.kubernetes.io/` namespaces — multi-node networking fabric type (when not standardized), infrastructure provider context, operational tier, pricing model. These need a `modelplane.ai/*` namespace. The vocabulary is bounded but needs design.

**Override semantics for derived attributes.** For Modelplane-provisioned environments, derived attributes have defaults but the operator may need to override them (unusual configurations, custom networking). The override mechanism needs to be clear.

**Capacity versus attributes distinction.** DRA distinguishes `capacity` (consumable resources like memory bytes) from `attributes` (descriptive properties like architecture). Modelplane's pool spec should follow the same distinction. Per-device memory is capacity; architecture is an attribute.

### Engagement with the broader ecosystem

The standardization work for cross-vendor attributes (KEP-4381, KEP-5316, KEP-5491) is ongoing in WG-Device-Management with NVIDIA, AMD, and Google participating. Modelplane has a federation-layer perspective that could contribute useful input on what attributes matter for cross-environment placement and what's missing from current standardization. Engaging WG-Device-Management and SIG-Scheduling during the design phase — through issues, KEPs, or working group meetings — is worth considering, since aligning with the ecosystem early is meaningfully cheaper than conforming later.

### What this changes about the existing API

This is a meaningful API rework on the InferenceEnvironment side. The main shifts:

- Remove the existing GKE-specific `gpu` block at the top level of node pools
- Add `provisioning` section that's CSP-namespaced
- Add environment-level `attributes` on InferenceEnvironment.spec
- Add `nodeAttributes` on each node pool for pool-uniform node properties
- Add `deviceAttributes` on each node pool for DRA-conformant device properties
- Update status to mirror declared attributes

ClusterModel serving profiles need updating in tandem (per #topology-and-capabilities-issue) to use three-claim CEL expressions matching against attributes at each level. The two issues should be designed together since the matching algorithm spans both.

This isn't a small change but the alternatives (inventing parallel vocabularies, conflating provisioning with matching, fighting against where Kubernetes is heading for accelerator management) are worse. Better to do this work once, properly, with the ecosystem rather than against it.

### References

- DRA in Kubernetes: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
- KEP-4381 (DRA structured parameters, GA in 1.35): https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters
- KEP-5316 (Standard device attributes, in progress): https://github.com/kubernetes/enhancements/issues/5316
- KEP-5491 (List types for attributes, Alpha in 1.36): https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/5491-dra-list-types-for-attributes
- NVIDIA DRA driver: https://github.com/NVIDIA/k8s-dra-driver
- AMD DRA driver: https://github.com/ROCm/k8s-dra-driver
- GKE DRA documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/about-dynamic-resource-allocation
- GKE ComputeClasses: https://cloud.google.com/kubernetes-engine/docs/concepts/about-compute-classes
- WG-Device-Management: https://github.com/kubernetes/community/tree/master/wg-device-management


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Align hardware capabilities design with Kubernetes Dynamic Resource Allocation #56

What problem are you facing?

How could Modelplane help solve your problem?

Three levels of attributes

How matching uses these levels

Why this is three levels, not one

A future consideration: discovered attributes from runtime DRA

Why this approach

Questions for the design

Engagement with the broader ecosystem

What this changes about the existing API

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Align hardware capabilities design with Kubernetes Dynamic Resource Allocation #56

Description

What problem are you facing?

How could Modelplane help solve your problem?

Three levels of attributes

How matching uses these levels

Why this is three levels, not one

A future consideration: discovered attributes from runtime DRA

Why this approach

Questions for the design

Engagement with the broader ecosystem

What this changes about the existing API

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions