Skip to content

MIG-aware allocation in InferenceEnvironment #53

Description

@bassam

What problem are you facing?

Modelplane has no awareness of MIG. A platform engineer with a MIG-configured cluster — whether they configured it via GKE's gpu-partition-size flag, EKS with the NVIDIA GPU Operator, or hand-rolled it via the MIG manager — has no way to declare it in an InferenceEnvironment. The schema treats GPUs as whole-GPU exclusive allocations. Operators wanting to run small models efficiently on expensive hardware can't.

A 27B model quantized to FP8 needs about 35-40GB. Running it on a whole H100 (80GB) wastes half the memory. An H100 partitioned into 2× 40GB MIG slices could host two independent serving instances on one physical GPU, each with hardware-isolated memory and compute. Each tenant gets dedicated allocation; nothing is shared. This is the right answer for a fleet serving lots of small models on Hopper-class hardware.

How could Modelplane help solve your problem?

MIG is a property of the node pool's GPU configuration, not of the model. The platform engineer decides on partitioning based on their workload mix and hardware. From the model's perspective, it just wants "enough memory and compute for this configuration." Whether that's a whole H100 or a 3g.40gb MIG slice is invisible to the model.

The cleanest place to express this is the node pool's capabilities block (introduced in #replica-topology-issue). Pools declare their per-allocation memory and let the deploy function handle matching uniformly.

For a MIG-partitioned pool, the platform engineer declares it like any other pool, with the per-slice memory as memoryPerGpu:

nodePools:
- name: small-models-mig
  acceleratorType: nvidia-h100-80gb
  acceleratorCount: 8
  nodeCount: 0
  maxNodeCount: 4
  labels:
    modelplane.ai/pool: small-mig
  capabilities:
    memoryPerGpu: "40Gi"          # per-slice, not per-physical-GPU
    interconnect: pcie            # MIG slices don't share NVLink
    precisionSupport: ["fp16", "bf16", "fp8"]
    multiNodeCapable: false
    nvlinkDomainSize: 1
    migConfiguration:
      enabled: true
      profile: "3g.40gb"

The capabilities.memoryPerGpu is the per-slice memory (40Gi for 3g.40gb), not the underlying physical GPU memory (80Gi). The interconnect drops to pcie because MIG slices don't share an NVLink fabric — they're on the same physical GPU but isolated. The migConfiguration block is documentation, not used by matching at v0.1; it tells the operator and the catalog which MIG profile is in use.

The acceleratorCount field stays at the physical GPU count per node. The deploy function calculates available allocation units as acceleratorCount × slices_per_gpu based on the MIG profile when partitioning is enabled. For 3g.40gb on H100 it's 2 slices per GPU, so a node has 16 allocation units. For 1g.10gb it's 7 slices per GPU (28 allocation units per node).

When a Gemma 3 27B serving profile requests replicaTopology: { nodes: 1, gpusPerNode: 1, requires: { memoryPerGpu: 40Gi, precisionSupport: [fp8] }}, the deploy function matches the MIG pool: memoryPerGpu: 40Gi is satisfied (slice has 40Gi), precisionSupport: [fp8] is satisfied (Hopper supports FP8 on the slice), nodes × gpusPerNode = 1 allocation unit needed and the pool has plenty.

The composed LLMInferenceService requests nvidia.com/mig-3g.40gb: 1 instead of nvidia.com/gpu: 1. The Modelplane place function knows from the pool's migConfiguration to use the correct device plugin resource name. Kubernetes schedules the pod onto a node where a slice is available; multiple pods from different deployments can share the same physical GPU's other slices without interfering.

This works the same way for any MIG profile. A 2g.20gb pool serves 7B models. A 1g.10gb pool serves only the smallest models or development workloads. The schema is uniform; only the pool capabilities change.

For the v0.1 implementation, MIG support is BYO-only. The platform engineer configures MIG out of band — via gcloud container node-pools create --accelerator gpu-partition-size=... on GKE, or via NVIDIA GPU Operator + MIG manager configmaps on EKS, or however else — and points Modelplane at the resulting pool via a node selector. Modelplane consumes the capacity as declared. The InferenceEnvironment doesn't need a source: GKE provisioning path that knows about MIG; that's v0.2+ work.

spec:
  cluster:
    source: Existing
    existing:
      secretRef:
        name: mig-cluster-kubeconfig
  
  nodePools:
  - name: small-models-mig
    nodeSelector:
      cloud.google.com/gke-gpu-partition-size: "3g.40gb"
    acceleratorType: nvidia-h100-80gb
    acceleratorCount: 8
    nodeCount: 0
    maxNodeCount: 8
    capabilities:
      memoryPerGpu: "40Gi"
      interconnect: pcie
      precisionSupport: ["fp16", "bf16", "fp8"]
      multiNodeCapable: false
      migConfiguration:
        enabled: true
        profile: "3g.40gb"

The nodeSelector field on the node pool tells Modelplane how to identify nodes belonging to this pool. For GKE-MIG nodes, the cloud.google.com/gke-gpu-partition-size label is automatically applied. For EKS via NVIDIA GPU Operator, the labels come from the GPU operator's discovery. For BYO clusters with custom labeling, the platform engineer chooses whatever scheme they want. The placement function uses the node selector when composing the LLMInferenceService template.

Metadata

Metadata

Assignees

No one assigned

    Labels

    SchedulingScheduling componentenhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions