What happened?
A GPU node pool's minNodeCount/maxNodeCount have no effect on EKS, and the scheduler trusts maxNodeCount as if the pool will scale to it. Nothing on the workload cluster actually scales the node group, so a pool with nodeCount: 1, maxNodeCount: 4 only ever has one node, but the scheduler treats it as a four-node pool.
compose-eks-cluster maps the pool's counts straight onto the EKS managed node group's scaling config at fn.py:399:
scalingConfig=ngv1beta1.ScalingConfig(
desiredSize=pool.nodeCount,
minSize=pool.minNodeCount,
maxSize=pool.maxNodeCount,
),
EKS treats minSize/maxSize as bounds for an autoscaler to move desiredSize between. But Modelplane installs no cluster-autoscaler or Karpenter on the workload cluster, so desiredSize (i.e. nodeCount) is the only count that ever takes effect. minNodeCount/maxNodeCount are inert.
This becomes a scheduling correctness problem because the InferenceCluster status reports the pool size as maxNodeCount, not the real node count. The XRD says so directly at inferenceclusters/definition.yaml:257:
nodes:
type: integer
description: >-
Number of nodes in this pool. Derived from
maxNodeCount (if autoscaling) or nodeCount.
So the ModelDeployment scheduler, which matches a gang against status.gpuPools[].nodes, will place a multi-node deployment on a pool that advertises enough nodes but can never produce them. The gang's extra pods stay Pending forever with no signal that the pool won't grow.
I hit the symptom directly while validating a multi-node (2-node gang) deployment: with nodeCount: 1 the worker pod could not schedule, and the node group never scaled up. Setting nodeCount: 2 fixed it; maxNodeCount: 4 did nothing.
0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 cannot allocate all claims. still not schedulable.
How can we reproduce it?
- Create an InferenceCluster on EKS with a GPU pool sized below a gang, relying on autoscaling to cover it:
nodePools:
- name: gpu-l4
className: eks-l4-1x-g6
nodeCount: 1
minNodeCount: 0
maxNodeCount: 4
- Deploy a multi-node ModelDeployment whose gang needs 2 nodes (e.g.
examples/deployment/model-deployment-multinode.yaml).
- The scheduler places the replica on
gpu-l4 (its status.gpuPools[].nodes reports 4). The leader runs, but the worker pod stays Pending because the node group never scales past nodeCount: 1.
Workaround: set nodeCount to at least the gang size. The min/maxNodeCount fields don't substitute for it without an autoscaler.
What environment did it happen in?
Modelplane version: combined branch of #150/#154/#155/#160/#161/#162/#163 off main 48700e6d
Crossplane version: v2.3.2
Inference backend: vLLM 0.11.0 (LeaderWorkerSet, 2-node gang)
Cloud / cluster: AWS EKS, Kubernetes v1.36, provider-aws-eks v2.5.0
What happened?
A GPU node pool's
minNodeCount/maxNodeCounthave no effect on EKS, and the scheduler trustsmaxNodeCountas if the pool will scale to it. Nothing on the workload cluster actually scales the node group, so a pool withnodeCount: 1, maxNodeCount: 4only ever has one node, but the scheduler treats it as a four-node pool.compose-eks-clustermaps the pool's counts straight onto the EKS managed node group's scaling config atfn.py:399:EKS treats
minSize/maxSizeas bounds for an autoscaler to movedesiredSizebetween. But Modelplane installs no cluster-autoscaler or Karpenter on the workload cluster, sodesiredSize(i.e.nodeCount) is the only count that ever takes effect.minNodeCount/maxNodeCountare inert.This becomes a scheduling correctness problem because the InferenceCluster status reports the pool size as
maxNodeCount, not the real node count. The XRD says so directly atinferenceclusters/definition.yaml:257:So the ModelDeployment scheduler, which matches a gang against
status.gpuPools[].nodes, will place a multi-node deployment on a pool that advertises enough nodes but can never produce them. The gang's extra pods stayPendingforever with no signal that the pool won't grow.I hit the symptom directly while validating a multi-node (2-node gang) deployment: with
nodeCount: 1the worker pod could not schedule, and the node group never scaled up. SettingnodeCount: 2fixed it;maxNodeCount: 4did nothing.How can we reproduce it?
examples/deployment/model-deployment-multinode.yaml).gpu-l4(itsstatus.gpuPools[].nodesreports 4). The leader runs, but the worker pod staysPendingbecause the node group never scales pastnodeCount: 1.Workaround: set
nodeCountto at least the gang size. Themin/maxNodeCountfields don't substitute for it without an autoscaler.What environment did it happen in?
Modelplane version: combined branch of #150/#154/#155/#160/#161/#162/#163 off main
48700e6dCrossplane version: v2.3.2
Inference backend: vLLM 0.11.0 (LeaderWorkerSet, 2-node gang)
Cloud / cluster: AWS EKS, Kubernetes v1.36, provider-aws-eks v2.5.0