Autoscale EKS GPU node pools with the cluster autoscaler#183
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds Kubernetes Cluster Autoscaler support to EKS-based EKSCluster compositions so GPU node pools can actually scale up to maxNodeCount, aligning EKS behavior with GKE and preventing the scheduler from overcommitting to capacity that will never materialize.
Changes:
- Compose cluster-autoscaler on EKS clusters (IAM policy/role + Pod Identity association + Helm Release gated on cluster observation).
- Pin the EKS cluster name to a compose-time-known value used by autoscaler autodiscovery.
- Extend the EKSCluster composition pipeline with a
compose-usagesstep and update unit tests / schema lock.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
schemas/.lock.json |
Updates schema lock to reflect generated model changes used by the new resources. |
functions/compose-eks-cluster/function/fn.py |
Composes autoscaler IAM + Pod Identity + Helm Release and pins cluster naming for autodiscovery. |
functions/compose-eks-cluster/tests/test_fn.py |
Adds expected resources and gating behavior coverage for autoscaler composition. |
apis/eksclusters/composition.yaml |
Adds compose-usages pipeline step to keep ProviderConfig dependencies alive for teardown ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The fleet scheduler treats InferenceCluster.status.gpuPools[].nodes as the node headroom it may place ModelReplicas against, and for every cluster source that is a pool's maxNodeCount. On GKE that holds: the managed control plane autoscales node pools up to maxNodeCount on demand. On EKS it didn't. We compose the managed node group with a scalingConfig but install nothing to scale within it, so only the realized nodeCount ever materializes. The scheduler, trusting maxNodeCount, places gangs onto nodes that never appear and the pods hang Pending forever (#166). DRA rules out the obvious alternatives: it's incompatible with both Karpenter and EKS Auto Mode, so neither can back our GPU pools. That leaves the Kubernetes cluster autoscaler on managed node groups. This change composes the autoscaler in compose-eks-cluster, alongside the EFS CSI driver it mirrors: a custom IAM policy and role bound to the cluster-autoscaler ServiceAccount through EKS Pod Identity (reusing the eks-pod-identity-agent addon), and the cluster-autoscaler Helm chart on the cluster's own helm ProviderConfig. The autoscaler discovers node groups by the tags EKS puts on their ASGs, so the EKS cluster name is pinned to the XR name to keep autoDiscovery.clusterName in sync. The Helm release is gated on the cluster being observed, and the EKSCluster pipeline gains a compose-usages step so the ProviderConfig outlives the release on teardown. Fixes #166. Signed-off-by: Nic Cope <nicc@rk0n.org>
The node groups set scalingConfig.desiredSize in forProvider, which Crossplane continuously reconciles. Once the cluster autoscaler scales a group's ASG, Crossplane reverts its DesiredCapacity back to the composed nodeCount on the next reconcile, fighting the autoscaler — the classic autoscaler-versus-IaC conflict. An end-to-end test saw the two coexist during a scale-up, but on a longer horizon Crossplane would periodically scale the group back down. This moves desiredSize into initProvider, which seeds it only at creation and is then ignored, and sets managementPolicies to exclude LateInitialize so the initProvider value takes effect. Crossplane now owns min/max; the autoscaler owns desired. This is the canonical initProvider use case in the Crossplane docs. Signed-off-by: Nic Cope <nicc@rk0n.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Fixes #166.
Closes #173.
The fleet scheduler treats
InferenceCluster.status.gpuPools[].nodesas the node headroom it may placeModelReplicas against — a pool'smaxNodeCountfor every cluster source. On GKE that holds: the managed control plane autoscales node pools up tomaxNodeCounton demand. On EKS it didn't. We compose the managed node group with ascalingConfigbut install nothing to scale within it, so only the realizednodeCountever materializes. The scheduler, trustingmaxNodeCount, places gangs onto nodes that never appear and the pods hangPendingforever.DRA rules out the obvious alternatives: it's incompatible with both Karpenter and EKS Auto Mode, so neither can back our GPU pools. That leaves the Kubernetes cluster autoscaler on managed node groups.
This composes the autoscaler in
compose-eks-cluster, alongside the EFS CSI driver it mirrors: a custom IAM policy and role bound to thecluster-autoscalerServiceAccount through EKS Pod Identity (reusing theeks-pod-identity-agentaddon), and the cluster-autoscaler Helm chart on the cluster's own helmProviderConfig. The autoscaler discovers node groups by the tags EKS puts on their ASGs, so the EKS cluster name is pinned to the XR name to keepautoDiscovery.clusterNamein sync. The Helm release is gated on the cluster being observed, and the EKSCluster pipeline gains acompose-usagesstep so theProviderConfigoutlives the release on teardown.With a working autoscaler on EKS,
maxNodeCountis reachable headroom on both sources — so this supersedes #173 (the per-sourceautoscaledflag): the node countgpu_poolsalready publishes is now honest for EKS, with no per-source distinction needed.I have:
nix flake check(or./nix.sh flake check) and made sure it passes.git commit -s.