Skip to content

EKS has no autoscaler installed #166

Description

@negz

What happened?

A GPU node pool's minNodeCount/maxNodeCount have no effect on EKS, and the scheduler trusts maxNodeCount as if the pool will scale to it. Nothing on the workload cluster actually scales the node group, so a pool with nodeCount: 1, maxNodeCount: 4 only ever has one node, but the scheduler treats it as a four-node pool.

compose-eks-cluster maps the pool's counts straight onto the EKS managed node group's scaling config at fn.py:399:

scalingConfig=ngv1beta1.ScalingConfig(
    desiredSize=pool.nodeCount,
    minSize=pool.minNodeCount,
    maxSize=pool.maxNodeCount,
),

EKS treats minSize/maxSize as bounds for an autoscaler to move desiredSize between. But Modelplane installs no cluster-autoscaler or Karpenter on the workload cluster, so desiredSize (i.e. nodeCount) is the only count that ever takes effect. minNodeCount/maxNodeCount are inert.

This becomes a scheduling correctness problem because the InferenceCluster status reports the pool size as maxNodeCount, not the real node count. The XRD says so directly at inferenceclusters/definition.yaml:257:

nodes:
  type: integer
  description: >-
    Number of nodes in this pool. Derived from
    maxNodeCount (if autoscaling) or nodeCount.

So the ModelDeployment scheduler, which matches a gang against status.gpuPools[].nodes, will place a multi-node deployment on a pool that advertises enough nodes but can never produce them. The gang's extra pods stay Pending forever with no signal that the pool won't grow.

I hit the symptom directly while validating a multi-node (2-node gang) deployment: with nodeCount: 1 the worker pod could not schedule, and the node group never scaled up. Setting nodeCount: 2 fixed it; maxNodeCount: 4 did nothing.

0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 cannot allocate all claims. still not schedulable.

How can we reproduce it?

  1. Create an InferenceCluster on EKS with a GPU pool sized below a gang, relying on autoscaling to cover it:
nodePools:
- name: gpu-l4
  className: eks-l4-1x-g6
  nodeCount: 1
  minNodeCount: 0
  maxNodeCount: 4
  1. Deploy a multi-node ModelDeployment whose gang needs 2 nodes (e.g. examples/deployment/model-deployment-multinode.yaml).
  2. The scheduler places the replica on gpu-l4 (its status.gpuPools[].nodes reports 4). The leader runs, but the worker pod stays Pending because the node group never scales past nodeCount: 1.

Workaround: set nodeCount to at least the gang size. The min/maxNodeCount fields don't substitute for it without an autoscaler.

What environment did it happen in?

Modelplane version: combined branch of #150/#154/#155/#160/#161/#162/#163 off main 48700e6d
Crossplane version: v2.3.2
Inference backend: vLLM 0.11.0 (LeaderWorkerSet, 2-node gang)
Cloud / cluster: AWS EKS, Kubernetes v1.36, provider-aws-eks v2.5.0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions