Skip to content

nodeSelector should describe hardware as a list of DRA devices #103

Description

@negz

What problem are you facing?

A ModelDeployment spec.nodeSelector.cel is a single CEL expression matched against a pool's merged attributes and capacity, written in a flat dialect: attributes["gpu.nvidia.com/cudaComputeCapability"]... and capacity["gpu.nvidia.com/memory"].... The design says we'll translate it into a DRA ResourceClaim as "a straightforward split of domain/name into device.attributes["domain"].name."

Two problems.

It looks like DRA but doesn't work like DRA. The flat attributes[...]/capacity[...] shape, the version(...) wrapping, and the version() (vs DRA's semver()) naming all diverge from real DRA CEL. A user familiar with DRA will write device.attributes["gpu.nvidia.com"].architecture and .isGreaterThan(semver("9.0.0")) and be surprised when it doesn't work. The design already leans hard on DRA (KEP-4381, typed attributes, qualified keys), so the half-resemblance is the worst of both worlds.

It can't describe more than one device. A node can have a GPU, an InfiniBand NIC, and potentially other accelerators. In real DRA these are distinct devices, often from distinct drivers, and a ResourceClaim makes one request per device. You can't express "a GPU like X and a NIC like Y" in a single selector. Our flat model flattens everything onto one synthetic device, so a platform team can't fully describe an InferenceClass's hardware, an ML team can't filter to a pool that has a specific GPU and NIC, and the ResourceClaim translation isn't actually mechanical for multi-device requirements.

The modelplane.ai/* fleet-attribute convention (e.g. modelplane.ai/networkInterNode: infiniband) was a workaround for this: inter-node networking had nowhere to live except a synthetic pool attribute, distinguished from real device attributes only by a key prefix.

GPUs are the only hardware real DRA drivers expose as devices today (NVIDIA's driver exposes GPU and ComputeDomain; InfiniBand/RDMA is still handled via SR-IOV/device-plugin, not DRA). So a NIC "device" is partly aspirational. But unlike DRA, we don't depend on a real driver publishing facts: the platform team authors the InferenceClass by hand. We can model InfiniBand as a device even where no DRA driver does. The constraint that replaces "did a driver publish this" is "is this device claimed via DRA, or synthetic and enforced only by our scheduler."

How could Modelplane help solve your problem?

Make InferenceClass and nodeSelector describe hardware as a list of DRA-style devices, with CEL that is DRA CEL.

InferenceClass declares an array of devices, each with attributes and capacity (DRA's typed schema) and a driver. A claim discriminator says how the device is claimed: DRA (default; emitted as a request in a real ResourceClaim) or Synthetic (described for scheduling only, never claimed). It's an enum so future claim mechanisms (device plugin, extended resource) can be added without a breaking change.

apiVersion: modelplane.ai/v1alpha1
kind: InferenceClass
metadata:
  name: gke-h200-8x-a3-ib
spec:
  provisioning: { ... }   # unchanged
  devices:
  - name: gpu
    claim: DRA                      # default; emitted as a DeviceRequest in the ResourceClaim
    driver: gpu.nvidia.com
    count: 8
    attributes:
      architecture: { string: Hopper }
      cudaComputeCapability: { version: "9.0.0" }
    capacity:
      memory: { value: "141Gi" }
  - name: ib
    claim: Synthetic                # described for scheduling only; not in the ResourceClaim
    driver: nic.nvidia.com          # no real DRA driver yet; we author it anyway
    count: 8
    attributes:
      linkType: { string: infiniband }
    capacity:
      bandwidth: { value: "400Gi" }

This replaces the the modelplane.ai/* convention. The real/synthetic boundary becomes an explicit per-device claim discriminator instead of a key-prefix heuristic.

nodeSelector becomes a list of device requests, each with its own DRA CEL selectors and count, mirroring a ResourceClaim's requests:

# Minimal: one GPU, >= 80Gi.
nodeSelector:
  devices:
  - name: gpu
    selectors:
    - cel: device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("80Gi")) >= 0
# Maximal: 8 Hopper GPUs >= 141Gi, and 8 InfiniBand NICs >= 400Gb/s.
nodeSelector:
  devices:
  - name: gpu
    count: 8
    selectors:
    - cel: |
        device.driver == "gpu.nvidia.com" &&
        device.attributes["gpu.nvidia.com"].architecture == "Hopper" &&
        device.attributes["gpu.nvidia.com"].cudaComputeCapability.isGreaterThan(semver("9.0.0")) &&
        device.capacity["gpu.nvidia.com"].memory.compareTo(quantity("141Gi")) >= 0
  - name: nic
    count: 8
    selectors:
    - cel: |
        device.driver == "nic.nvidia.com" &&
        device.attributes["nic.nvidia.com"].linkType == "infiniband" &&
        device.capacity["nic.nvidia.com"].bandwidth.compareTo(quantity("400Gi")) >= 0

Each request has a name, as a DRA DeviceRequest does. It's required, so future additions that reference a request by name (e.g. constraints) stay additive.

selectors is a list, as in DRA, where each entry is a one-of (today just cel). A list of CEL selectors is ANDed, equivalent to joining with &&, so it adds no matching expressiveness on its own. We mirror the shape so a future non-CEL selector kind can be added as a new one-of member without a breaking change.

Inside devices[].cel it's real DRA CEL: device.driver, device.attributes["domain"].name, device.capacity["domain"].name, quantity(), semver(), version attributes pre-parsed as semver (no wrapping). A DRA user's expressions transfer verbatim, and translation is mechanical: one claim: DRA device becomes one DeviceRequest (count + selector); claim: Synthetic devices are dropped and enforced by our scheduler.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions