EKS has no autoscaler installed

### What happened?

A GPU node pool's `minNodeCount`/`maxNodeCount` have no effect on EKS, and the scheduler trusts `maxNodeCount` as if the pool will scale to it. Nothing on the workload cluster actually scales the node group, so a pool with `nodeCount: 1, maxNodeCount: 4` only ever has one node, but the scheduler treats it as a four-node pool.

`compose-eks-cluster` maps the pool's counts straight onto the EKS managed node group's scaling config at [`fn.py:399`](https://github.com/modelplaneai/modelplane/blob/main/functions/compose-eks-cluster/function/fn.py#L399):

```python
scalingConfig=ngv1beta1.ScalingConfig(
    desiredSize=pool.nodeCount,
    minSize=pool.minNodeCount,
    maxSize=pool.maxNodeCount,
),
```

EKS treats `minSize`/`maxSize` as bounds for an autoscaler to move `desiredSize` between. But Modelplane installs no cluster-autoscaler or Karpenter on the workload cluster, so `desiredSize` (i.e. `nodeCount`) is the only count that ever takes effect. `minNodeCount`/`maxNodeCount` are inert.

This becomes a scheduling correctness problem because the InferenceCluster status reports the pool size as `maxNodeCount`, not the real node count. The XRD says so directly at [`inferenceclusters/definition.yaml:257`](https://github.com/modelplaneai/modelplane/blob/main/apis/inferenceclusters/definition.yaml#L257):

```yaml
nodes:
  type: integer
  description: >-
    Number of nodes in this pool. Derived from
    maxNodeCount (if autoscaling) or nodeCount.
```

So the ModelDeployment scheduler, which matches a gang against `status.gpuPools[].nodes`, will place a multi-node deployment on a pool that advertises enough nodes but can never produce them. The gang's extra pods stay `Pending` forever with no signal that the pool won't grow.

I hit the symptom directly while validating a multi-node (2-node gang) deployment: with `nodeCount: 1` the worker pod could not schedule, and the node group never scaled up. Setting `nodeCount: 2` fixed it; `maxNodeCount: 4` did nothing.

```
0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 cannot allocate all claims. still not schedulable.
```

### How can we reproduce it?

1. Create an InferenceCluster on EKS with a GPU pool sized below a gang, relying on autoscaling to cover it:

```yaml
nodePools:
- name: gpu-l4
  className: eks-l4-1x-g6
  nodeCount: 1
  minNodeCount: 0
  maxNodeCount: 4
```

2. Deploy a multi-node ModelDeployment whose gang needs 2 nodes (e.g. `examples/deployment/model-deployment-multinode.yaml`).
3. The scheduler places the replica on `gpu-l4` (its `status.gpuPools[].nodes` reports 4). The leader runs, but the worker pod stays `Pending` because the node group never scales past `nodeCount: 1`.

Workaround: set `nodeCount` to at least the gang size. The `min`/`maxNodeCount` fields don't substitute for it without an autoscaler.

### What environment did it happen in?

Modelplane version: combined branch of #150/#154/#155/#160/#161/#162/#163 off main `48700e6d`
Crossplane version: v2.3.2
Inference backend: vLLM 0.11.0 (LeaderWorkerSet, 2-node gang)
Cloud / cluster: AWS EKS, Kubernetes v1.36, provider-aws-eks v2.5.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EKS has no autoscaler installed #166

What happened?

How can we reproduce it?

What environment did it happen in?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

EKS has no autoscaler installed #166

Description

What happened?

How can we reproduce it?

What environment did it happen in?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions