`nodeSelector` should describe hardware as a list of DRA devices

### What problem are you facing?

A `ModelDeployment` `spec.nodeSelector.cel` is a single CEL expression matched against a pool's merged attributes and capacity, written in a flat dialect: `attributes["gpu.nvidia.com/cudaComputeCapability"]...` and `capacity["gpu.nvidia.com/memory"]...`. The design says we'll translate it into a DRA `ResourceClaim` as "a straightforward split of `domain/name` into `device.attributes["domain"].name`."

Two problems.

**It looks like DRA but doesn't work like DRA.** The flat `attributes[...]`/`capacity[...]` shape, the `version(...)` wrapping, and the `version()` (vs DRA's `semver()`) naming all diverge from real DRA CEL. A user familiar with DRA will write `device.attributes["gpu.nvidia.com"].architecture` and `.isGreaterThan(semver("9.0.0"))` and be surprised when it doesn't work. The design already leans hard on DRA (KEP-4381, typed attributes, qualified keys), so the half-resemblance is the worst of both worlds.

**It can't describe more than one device.** A node can have a GPU, an InfiniBand NIC, and potentially other accelerators. In real DRA these are distinct devices, often from distinct drivers, and a `ResourceClaim` makes one request per device. You can't express "a GPU like X and a NIC like Y" in a single selector. Our flat model flattens everything onto one synthetic device, so a platform team can't fully describe an `InferenceClass`'s hardware, an ML team can't filter to a pool that has a specific GPU *and* NIC, and the ResourceClaim translation isn't actually mechanical for multi-device requirements.

The `modelplane.ai/*` fleet-attribute convention (e.g. `modelplane.ai/networkInterNode: infiniband`) was a workaround for this: inter-node networking had nowhere to live except a synthetic pool attribute, distinguished from real device attributes only by a key prefix.

GPUs are the only hardware real DRA drivers expose as devices today (NVIDIA's driver exposes `GPU` and `ComputeDomain`; InfiniBand/RDMA is still handled via SR-IOV/device-plugin, not DRA). So a NIC "device" is partly aspirational. But unlike DRA, we don't depend on a real driver publishing facts: the platform team authors the `InferenceClass` by hand. We can model InfiniBand as a device even where no DRA driver does. The constraint that replaces "did a driver publish this" is "is this device claimed via DRA, or synthetic and enforced only by our scheduler."

### How could Modelplane help solve your problem?

Make `InferenceClass` and `nodeSelector` describe hardware as a list of DRA-style devices, with CEL that *is* DRA CEL.

`InferenceClass` declares an array of devices, each with attributes and capacity (DRA's typed schema) and a driver. A `claim` discriminator says how the device is claimed: `DRA` (default; emitted as a request in a real `ResourceClaim`) or `Synthetic` (described for scheduling only, never claimed). It's an enum so future claim mechanisms (device plugin, extended resource) can be added without a breaking change.

```yaml
apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-h200-8x-a3-ib
spec:
  provisioning: { ... }   # unchanged
  devices:
  - name: gpu
    claim: DRA                      # default; emitted as a DeviceRequest in the ResourceClaim
    driver: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
      cudaComputeCapability: { version: "9.0.0" }
    capacity:
      memory: { value: "141Gi" }
  - name: ib
    claim: Synthetic                # described for scheduling only; not in the ResourceClaim
    driver: nic.nvidia.com          # no real DRA driver yet; we author it anyway
    count: 8
    attributes:
      linkType: { string: infiniband }
    capacity:
      bandwidth: { value: "400Gi" }
```

This replaces the the `modelplane.ai/*` convention. The real/synthetic boundary becomes an explicit per-device `claim` discriminator instead of a key-prefix heuristic.

`nodeSelector` becomes a list of device requests, each with its own DRA CEL selectors and count, mirroring a `ResourceClaim`'s requests:

```yaml
# Minimal: one GPU, >= 80Gi.
nodeSelector:
  devices:
  - name: gpu
    selectors:
    - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0
```

```yaml
# Maximal: 8 Hopper GPUs >= 141Gi, and 8 InfiniBand NICs >= 400Gb/s.
nodeSelector:
  devices:
  - name: gpu
    count: 8
    selectors:
    - cel: |
        device.driver == "gpu.nvidia.com" &&
        device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
        device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("9.0.0")) &&
        device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("141Gi")) >= 0
  - name: nic
    count: 8
    selectors:
    - cel: |
        device.driver == "nic.nvidia.com" &&
        device.attributes["nic.nvidia.com"].linkType == "infiniband" &&
        device.capacity["nic.nvidia.com"].bandwidth.compareTo(quantity("400Gi")) >= 0
```

Each request has a `name`, as a DRA `DeviceRequest` does. It's required, so future additions that reference a request by name (e.g. `constraints`) stay additive.

`selectors` is a list, as in DRA, where each entry is a one-of (today just `cel`). A list of CEL selectors is ANDed, equivalent to joining with `&&`, so it adds no matching expressiveness on its own. We mirror the shape so a future non-CEL selector kind can be added as a new one-of member without a breaking change.

Inside `devices[].cel` it's real DRA CEL: `device.driver`, `device.attributes["domain"].name`, `device.capacity["domain"].name`, `quantity()`, `semver()`, version attributes pre-parsed as semver (no wrapping). A DRA user's expressions transfer verbatim, and translation is mechanical: one `claim: DRA` device becomes one `DeviceRequest` (count + selector); `claim: Synthetic` devices are dropped and enforced by our scheduler.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`nodeSelector` should describe hardware as a list of DRA devices #103

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

nodeSelector should describe hardware as a list of DRA devices #103

Description

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`nodeSelector` should describe hardware as a list of DRA devices #103