What problem are you facing?
Modelplane has no awareness of MIG. A platform engineer with a MIG-configured cluster — whether they configured it via GKE's gpu-partition-size flag, EKS with the NVIDIA GPU Operator, or hand-rolled it via the MIG manager — has no way to declare it in an InferenceEnvironment. The schema treats GPUs as whole-GPU exclusive allocations. Operators wanting to run small models efficiently on expensive hardware can't.
A 27B model quantized to FP8 needs about 35-40GB. Running it on a whole H100 (80GB) wastes half the memory. An H100 partitioned into 2× 40GB MIG slices could host two independent serving instances on one physical GPU, each with hardware-isolated memory and compute. Each tenant gets dedicated allocation; nothing is shared. This is the right answer for a fleet serving lots of small models on Hopper-class hardware.
How could Modelplane help solve your problem?
MIG is a property of the node pool's GPU configuration, not of the model. The platform engineer decides on partitioning based on their workload mix and hardware. From the model's perspective, it just wants "enough memory and compute for this configuration." Whether that's a whole H100 or a 3g.40gb MIG slice is invisible to the model.
The cleanest place to express this is the node pool's capabilities block (introduced in #replica-topology-issue). Pools declare their per-allocation memory and let the deploy function handle matching uniformly.
For a MIG-partitioned pool, the platform engineer declares it like any other pool, with the per-slice memory as memoryPerGpu:
nodePools:
- name: small-models-mig
acceleratorType: nvidia-h100-80gb
acceleratorCount: 8
nodeCount: 0
maxNodeCount: 4
labels:
modelplane.ai/pool: small-mig
capabilities:
memoryPerGpu: "40Gi" # per-slice, not per-physical-GPU
interconnect: pcie # MIG slices don't share NVLink
precisionSupport: ["fp16", "bf16", "fp8"]
multiNodeCapable: false
nvlinkDomainSize: 1
migConfiguration:
enabled: true
profile: "3g.40gb"
The capabilities.memoryPerGpu is the per-slice memory (40Gi for 3g.40gb), not the underlying physical GPU memory (80Gi). The interconnect drops to pcie because MIG slices don't share an NVLink fabric — they're on the same physical GPU but isolated. The migConfiguration block is documentation, not used by matching at v0.1; it tells the operator and the catalog which MIG profile is in use.
The acceleratorCount field stays at the physical GPU count per node. The deploy function calculates available allocation units as acceleratorCount × slices_per_gpu based on the MIG profile when partitioning is enabled. For 3g.40gb on H100 it's 2 slices per GPU, so a node has 16 allocation units. For 1g.10gb it's 7 slices per GPU (28 allocation units per node).
When a Gemma 3 27B serving profile requests replicaTopology: { nodes: 1, gpusPerNode: 1, requires: { memoryPerGpu: 40Gi, precisionSupport: [fp8] }}, the deploy function matches the MIG pool: memoryPerGpu: 40Gi is satisfied (slice has 40Gi), precisionSupport: [fp8] is satisfied (Hopper supports FP8 on the slice), nodes × gpusPerNode = 1 allocation unit needed and the pool has plenty.
The composed LLMInferenceService requests nvidia.com/mig-3g.40gb: 1 instead of nvidia.com/gpu: 1. The Modelplane place function knows from the pool's migConfiguration to use the correct device plugin resource name. Kubernetes schedules the pod onto a node where a slice is available; multiple pods from different deployments can share the same physical GPU's other slices without interfering.
This works the same way for any MIG profile. A 2g.20gb pool serves 7B models. A 1g.10gb pool serves only the smallest models or development workloads. The schema is uniform; only the pool capabilities change.
For the v0.1 implementation, MIG support is BYO-only. The platform engineer configures MIG out of band — via gcloud container node-pools create --accelerator gpu-partition-size=... on GKE, or via NVIDIA GPU Operator + MIG manager configmaps on EKS, or however else — and points Modelplane at the resulting pool via a node selector. Modelplane consumes the capacity as declared. The InferenceEnvironment doesn't need a source: GKE provisioning path that knows about MIG; that's v0.2+ work.
spec:
cluster:
source: Existing
existing:
secretRef:
name: mig-cluster-kubeconfig
nodePools:
- name: small-models-mig
nodeSelector:
cloud.google.com/gke-gpu-partition-size: "3g.40gb"
acceleratorType: nvidia-h100-80gb
acceleratorCount: 8
nodeCount: 0
maxNodeCount: 8
capabilities:
memoryPerGpu: "40Gi"
interconnect: pcie
precisionSupport: ["fp16", "bf16", "fp8"]
multiNodeCapable: false
migConfiguration:
enabled: true
profile: "3g.40gb"
The nodeSelector field on the node pool tells Modelplane how to identify nodes belonging to this pool. For GKE-MIG nodes, the cloud.google.com/gke-gpu-partition-size label is automatically applied. For EKS via NVIDIA GPU Operator, the labels come from the GPU operator's discovery. For BYO clusters with custom labeling, the platform engineer chooses whatever scheme they want. The placement function uses the node selector when composing the LLMInferenceService template.
What problem are you facing?
Modelplane has no awareness of MIG. A platform engineer with a MIG-configured cluster — whether they configured it via GKE's
gpu-partition-sizeflag, EKS with the NVIDIA GPU Operator, or hand-rolled it via the MIG manager — has no way to declare it in an InferenceEnvironment. The schema treats GPUs as whole-GPU exclusive allocations. Operators wanting to run small models efficiently on expensive hardware can't.A 27B model quantized to FP8 needs about 35-40GB. Running it on a whole H100 (80GB) wastes half the memory. An H100 partitioned into 2× 40GB MIG slices could host two independent serving instances on one physical GPU, each with hardware-isolated memory and compute. Each tenant gets dedicated allocation; nothing is shared. This is the right answer for a fleet serving lots of small models on Hopper-class hardware.
How could Modelplane help solve your problem?
MIG is a property of the node pool's GPU configuration, not of the model. The platform engineer decides on partitioning based on their workload mix and hardware. From the model's perspective, it just wants "enough memory and compute for this configuration." Whether that's a whole H100 or a 3g.40gb MIG slice is invisible to the model.
The cleanest place to express this is the node pool's
capabilitiesblock (introduced in #replica-topology-issue). Pools declare their per-allocation memory and let the deploy function handle matching uniformly.For a MIG-partitioned pool, the platform engineer declares it like any other pool, with the per-slice memory as
memoryPerGpu:The
capabilities.memoryPerGpuis the per-slice memory (40Gi for3g.40gb), not the underlying physical GPU memory (80Gi). The interconnect drops topciebecause MIG slices don't share an NVLink fabric — they're on the same physical GPU but isolated. ThemigConfigurationblock is documentation, not used by matching at v0.1; it tells the operator and the catalog which MIG profile is in use.The
acceleratorCountfield stays at the physical GPU count per node. The deploy function calculates available allocation units asacceleratorCount × slices_per_gpubased on the MIG profile when partitioning is enabled. For3g.40gbon H100 it's 2 slices per GPU, so a node has 16 allocation units. For1g.10gbit's 7 slices per GPU (28 allocation units per node).When a Gemma 3 27B serving profile requests
replicaTopology: { nodes: 1, gpusPerNode: 1, requires: { memoryPerGpu: 40Gi, precisionSupport: [fp8] }}, the deploy function matches the MIG pool:memoryPerGpu: 40Giis satisfied (slice has 40Gi),precisionSupport: [fp8]is satisfied (Hopper supports FP8 on the slice),nodes × gpusPerNode = 1allocation unit needed and the pool has plenty.The composed LLMInferenceService requests
nvidia.com/mig-3g.40gb: 1instead ofnvidia.com/gpu: 1. The Modelplane place function knows from the pool'smigConfigurationto use the correct device plugin resource name. Kubernetes schedules the pod onto a node where a slice is available; multiple pods from different deployments can share the same physical GPU's other slices without interfering.This works the same way for any MIG profile. A
2g.20gbpool serves 7B models. A1g.10gbpool serves only the smallest models or development workloads. The schema is uniform; only the pool capabilities change.For the v0.1 implementation, MIG support is BYO-only. The platform engineer configures MIG out of band — via
gcloud container node-pools create --accelerator gpu-partition-size=...on GKE, or via NVIDIA GPU Operator + MIG manager configmaps on EKS, or however else — and points Modelplane at the resulting pool via a node selector. Modelplane consumes the capacity as declared. The InferenceEnvironment doesn't need asource: GKEprovisioning path that knows about MIG; that's v0.2+ work.The
nodeSelectorfield on the node pool tells Modelplane how to identify nodes belonging to this pool. For GKE-MIG nodes, thecloud.google.com/gke-gpu-partition-sizelabel is automatically applied. For EKS via NVIDIA GPU Operator, the labels come from the GPU operator's discovery. For BYO clusters with custom labeling, the platform engineer chooses whatever scheme they want. The placement function uses the node selector when composing the LLMInferenceService template.