MIG-aware allocation in InferenceEnvironment

### What problem are you facing?

Modelplane has no awareness of MIG. A platform engineer with a MIG-configured cluster — whether they configured it via GKE's `gpu-partition-size` flag, EKS with the NVIDIA GPU Operator, or hand-rolled it via the MIG manager — has no way to declare it in an InferenceEnvironment. The schema treats GPUs as whole-GPU exclusive allocations. Operators wanting to run small models efficiently on expensive hardware can't.

A 27B model quantized to FP8 needs about 35-40GB. Running it on a whole H100 (80GB) wastes half the memory. An H100 partitioned into 2× 40GB MIG slices could host two independent serving instances on one physical GPU, each with hardware-isolated memory and compute. Each tenant gets dedicated allocation; nothing is shared. This is the right answer for a fleet serving lots of small models on Hopper-class hardware.

### How could Modelplane help solve your problem?

MIG is a property of the node pool's GPU configuration, not of the model. The platform engineer decides on partitioning based on their workload mix and hardware. From the model's perspective, it just wants "enough memory and compute for this configuration." Whether that's a whole H100 or a 3g.40gb MIG slice is invisible to the model.

The cleanest place to express this is the node pool's `capabilities` block (introduced in #replica-topology-issue). Pools declare their per-allocation memory and let the deploy function handle matching uniformly.

For a MIG-partitioned pool, the platform engineer declares it like any other pool, with the per-slice memory as `memoryPerGpu`:

```yaml
nodePools:
- name: small-models-mig
  acceleratorType: nvidia-h100-80gb
  acceleratorCount: 8
  nodeCount: 0
  maxNodeCount: 4
  labels:
    modelplane.ai/pool: small-mig
  capabilities:
    memoryPerGpu: "40Gi"          # per-slice, not per-physical-GPU
    interconnect: pcie            # MIG slices don't share NVLink
    precisionSupport: ["fp16", "bf16", "fp8"]
    multiNodeCapable: false
    nvlinkDomainSize: 1
    migConfiguration:
      enabled: true
      profile: "3g.40gb"
```

The `capabilities.memoryPerGpu` is the per-slice memory (40Gi for `3g.40gb`), not the underlying physical GPU memory (80Gi). The interconnect drops to `pcie` because MIG slices don't share an NVLink fabric — they're on the same physical GPU but isolated. The `migConfiguration` block is documentation, not used by matching at v0.1; it tells the operator and the catalog which MIG profile is in use.

The `acceleratorCount` field stays at the physical GPU count per node. The deploy function calculates available allocation units as `acceleratorCount × slices_per_gpu` based on the MIG profile when partitioning is enabled. For `3g.40gb` on H100 it's 2 slices per GPU, so a node has 16 allocation units. For `1g.10gb` it's 7 slices per GPU (28 allocation units per node).

When a Gemma 3 27B serving profile requests `replicaTopology: { nodes: 1, gpusPerNode: 1, requires: { memoryPerGpu: 40Gi, precisionSupport: [fp8] }}`, the deploy function matches the MIG pool: `memoryPerGpu: 40Gi` is satisfied (slice has 40Gi), `precisionSupport: [fp8]` is satisfied (Hopper supports FP8 on the slice), `nodes × gpusPerNode = 1` allocation unit needed and the pool has plenty.

The composed LLMInferenceService requests `nvidia.com/mig-3g.40gb: 1` instead of `nvidia.com/gpu: 1`. The Modelplane place function knows from the pool's `migConfiguration` to use the correct device plugin resource name. Kubernetes schedules the pod onto a node where a slice is available; multiple pods from different deployments can share the same physical GPU's other slices without interfering.

This works the same way for any MIG profile. A `2g.20gb` pool serves 7B models. A `1g.10gb` pool serves only the smallest models or development workloads. The schema is uniform; only the pool capabilities change.

For the v0.1 implementation, MIG support is BYO-only. The platform engineer configures MIG out of band — via `gcloud container node-pools create --accelerator gpu-partition-size=...` on GKE, or via NVIDIA GPU Operator + MIG manager configmaps on EKS, or however else — and points Modelplane at the resulting pool via a node selector. Modelplane consumes the capacity as declared. The InferenceEnvironment doesn't need a `source: GKE` provisioning path that knows about MIG; that's v0.2+ work.

```yaml
spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: mig-cluster-kubeconfig
  
  nodePools:
  - name: small-models-mig
    nodeSelector:
      cloud.google.com/gke-gpu-partition-size: "3g.40gb"
    acceleratorType: nvidia-h100-80gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 8
    capabilities:
      memoryPerGpu: "40Gi"
      interconnect: pcie
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: false
      migConfiguration:
        enabled: true
        profile: "3g.40gb"
```

The `nodeSelector` field on the node pool tells Modelplane how to identify nodes belonging to this pool. For GKE-MIG nodes, the `cloud.google.com/gke-gpu-partition-size` label is automatically applied. For EKS via NVIDIA GPU Operator, the labels come from the GPU operator's discovery. For BYO clusters with custom labeling, the platform engineer chooses whatever scheme they want. The placement function uses the node selector when composing the LLMInferenceService template.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MIG-aware allocation in InferenceEnvironment #53

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

MIG-aware allocation in InferenceEnvironment #53

Description

What problem are you facing?

How could Modelplane help solve your problem?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions