Autoscale EKS GPU node pools with the cluster autoscaler by negz · Pull Request #183 · modelplaneai/modelplane

negz · 2026-06-17T23:26:41Z

Description of your changes

Fixes #166.

Closes #173.

The fleet scheduler treats InferenceCluster.status.gpuPools[].nodes as the node headroom it may place ModelReplicas against — a pool's maxNodeCount for every cluster source. On GKE that holds: the managed control plane autoscales node pools up to maxNodeCount on demand. On EKS it didn't. We compose the managed node group with a scalingConfig but install nothing to scale within it, so only the realized nodeCount ever materializes. The scheduler, trusting maxNodeCount, places gangs onto nodes that never appear and the pods hang Pending forever.

DRA rules out the obvious alternatives: it's incompatible with both Karpenter and EKS Auto Mode, so neither can back our GPU pools. That leaves the Kubernetes cluster autoscaler on managed node groups.

This composes the autoscaler in compose-eks-cluster, alongside the EFS CSI driver it mirrors: a custom IAM policy and role bound to the cluster-autoscaler ServiceAccount through EKS Pod Identity (reusing the eks-pod-identity-agent addon), and the cluster-autoscaler Helm chart on the cluster's own helm ProviderConfig. The autoscaler discovers node groups by the tags EKS puts on their ASGs, so the EKS cluster name is pinned to the XR name to keep autoDiscovery.clusterName in sync. The Helm release is gated on the cluster being observed, and the EKSCluster pipeline gains a compose-usages step so the ProviderConfig outlives the release on teardown.

With a working autoscaler on EKS, maxNodeCount is reachable headroom on both sources — so this supersedes #173 (the per-source autoscaled flag): the node count gpu_pools already publishes is now honest for EKS, with no per-source distinction needed.

I have:

Read and followed Modelplane's contribution process.
Run nix flake check (or ./nix.sh flake check) and made sure it passes.
Added or updated tests covering any composition function changes.
Signed off every commit with git commit -s.

Copilot

Pull request overview

This PR adds Kubernetes Cluster Autoscaler support to EKS-based EKSCluster compositions so GPU node pools can actually scale up to maxNodeCount, aligning EKS behavior with GKE and preventing the scheduler from overcommitting to capacity that will never materialize.

Changes:

Compose cluster-autoscaler on EKS clusters (IAM policy/role + Pod Identity association + Helm Release gated on cluster observation).
Pin the EKS cluster name to a compose-time-known value used by autoscaler autodiscovery.
Extend the EKSCluster composition pipeline with a compose-usages step and update unit tests / schema lock.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`schemas/.lock.json`	Updates schema lock to reflect generated model changes used by the new resources.
`functions/compose-eks-cluster/function/fn.py`	Composes autoscaler IAM + Pod Identity + Helm Release and pins cluster naming for autodiscovery.
`functions/compose-eks-cluster/tests/test_fn.py`	Adds expected resources and gating behavior coverage for autoscaler composition.
`apis/eksclusters/composition.yaml`	Adds `compose-usages` pipeline step to keep ProviderConfig dependencies alive for teardown ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The fleet scheduler treats InferenceCluster.status.gpuPools[].nodes as the node headroom it may place ModelReplicas against, and for every cluster source that is a pool's maxNodeCount. On GKE that holds: the managed control plane autoscales node pools up to maxNodeCount on demand. On EKS it didn't. We compose the managed node group with a scalingConfig but install nothing to scale within it, so only the realized nodeCount ever materializes. The scheduler, trusting maxNodeCount, places gangs onto nodes that never appear and the pods hang Pending forever (#166). DRA rules out the obvious alternatives: it's incompatible with both Karpenter and EKS Auto Mode, so neither can back our GPU pools. That leaves the Kubernetes cluster autoscaler on managed node groups. This change composes the autoscaler in compose-eks-cluster, alongside the EFS CSI driver it mirrors: a custom IAM policy and role bound to the cluster-autoscaler ServiceAccount through EKS Pod Identity (reusing the eks-pod-identity-agent addon), and the cluster-autoscaler Helm chart on the cluster's own helm ProviderConfig. The autoscaler discovers node groups by the tags EKS puts on their ASGs, so the EKS cluster name is pinned to the XR name to keep autoDiscovery.clusterName in sync. The Helm release is gated on the cluster being observed, and the EKSCluster pipeline gains a compose-usages step so the ProviderConfig outlives the release on teardown. Fixes #166. Signed-off-by: Nic Cope <nicc@rk0n.org>

The node groups set scalingConfig.desiredSize in forProvider, which Crossplane continuously reconciles. Once the cluster autoscaler scales a group's ASG, Crossplane reverts its DesiredCapacity back to the composed nodeCount on the next reconcile, fighting the autoscaler — the classic autoscaler-versus-IaC conflict. An end-to-end test saw the two coexist during a scale-up, but on a longer horizon Crossplane would periodically scale the group back down. This moves desiredSize into initProvider, which seeds it only at creation and is then ignored, and sets managementPolicies to exclude LateInitialize so the initProvider value takes effect. Crossplane now owns min/max; the autoscaler owns desired. This is the canonical initProvider use case in the Crossplane docs. Signed-off-by: Nic Cope <nicc@rk0n.org>

negz requested a review from Copilot June 17, 2026 23:50

Copilot started reviewing on behalf of negz June 17, 2026 23:50 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread functions/compose-eks-cluster/function/fn.py

Comment thread functions/compose-eks-cluster/function/fn.py Outdated

Comment thread functions/compose-eks-cluster/function/fn.py

Comment thread functions/compose-eks-cluster/tests/test_fn.py

negz force-pushed the elastic-band branch from d4b5945 to d9b9e00 Compare June 18, 2026 00:05

negz added 2 commits June 17, 2026 19:10

negz force-pushed the elastic-band branch from 8fcfa3d to b3a6c15 Compare June 18, 2026 02:11

negz marked this pull request as ready for review June 18, 2026 03:11

negz merged commit 1101164 into main Jun 18, 2026
4 checks passed

negz deleted the elastic-band branch June 18, 2026 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autoscale EKS GPU node pools with the cluster autoscaler#183

Autoscale EKS GPU node pools with the cluster autoscaler#183
negz merged 2 commits into
mainfrom
elastic-band

negz commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

negz commented Jun 17, 2026

Description of your changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants