Description
Which component are you using?:
cluster-autoscaler for cluster API (CAPI).
What version of the component are you using?:
Component version: v.1.26.1
What k8s version are you using (kubectl version
)?:
v.1.26.1
What environment is this in?:
CAPI clusters using cluster API Provider AWS (CAPA).
There are 2 clusters (at least) involved in this setup. One is the management cluster which is also managed by CAPI (self-managed). A second cluster is the workload cluster, managed by the management cluster.
+------------+ +----------+
| mgmt |<-- | workload |
| ---------- | | | |
| CAPI +------------>| |
+------------+ +----------+
In this setup cluster-autoscaler was not configured to have access to the workload cluster, it just has access to the management cluster kubernetes API via its service account. It has access to the CAPI objects of all clusters and tries to do that via its auto-discovery feature.
What did you expect to happen?:
I would expect cluster API to do not consider clusters where it doesn't have access to its kubernetes API.
What happened instead?:
Cluster autoscaler by its autodiscovery feature, identifies MachineDeployments
for the workload cluster and since it can't access the kuberntes API assumes the nodes are unregistered and deletes them after 15 minutes (by default, configured by --max-node-provision-time
).
I0214 20:53:39.989780 1 static_autoscaler.go:388] 4 unregistered nodes present
I0214 20:53:40.558558 1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:40.558597 1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-036e48326d72b8596
I0214 20:53:41.161843 1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.161875 1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-000951411145ad6ff
I0214 20:53:41.162028 1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.298995 1 leaderelection.go:278] successfully renewed lease kube-system/cluster-autoscaler
I0214 20:53:41.355255 1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.355389 1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.758912 1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.758944 1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-03be732f299d68aa4
How to reproduce it (as minimally and precisely as possible):
- Create a cluster via CAPI and CAPA and make it self-managed.
- Create a second cluster managed by this first cluster.
- Deploy cluster-autoscaler to the management cluster with the
--cloud-provider=clusterapi
flag. No further config required. - Annotate the
MachineDeployments
withcluster.x-k8s.io/cluster-api-autoscaler-node-group-[min|max]-size
. - Check that in 15 minutes the
MachineDeployment
of the second cluster is scaled to its minimum even if the nodes are being utilized.
Anything else we need to know?:
It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig. We should document that and consider making the autodiscovery feature disabled by default.
That can lead to outages since the nodes might be under utilization.
I would like to work on this issue.