Skip to content

CAPI cluster autoscaler delete nodes from clusters it doesn't have access  #5510

Open
@jonathanbeber

Description

Which component are you using?:

cluster-autoscaler for cluster API (CAPI).

What version of the component are you using?:

Component version: v.1.26.1

What k8s version are you using (kubectl version)?:

v.1.26.1

What environment is this in?:

CAPI clusters using cluster API Provider AWS (CAPA).

There are 2 clusters (at least) involved in this setup. One is the management cluster which is also managed by CAPI (self-managed). A second cluster is the workload cluster, managed by the management cluster.

+------------+             +----------+
|    mgmt    |<--          | workload |
| ---------- |   |         |          |
|    CAPI    +------------>|          |
+------------+             +----------+

In this setup cluster-autoscaler was not configured to have access to the workload cluster, it just has access to the management cluster kubernetes API via its service account. It has access to the CAPI objects of all clusters and tries to do that via its auto-discovery feature.

What did you expect to happen?:

I would expect cluster API to do not consider clusters where it doesn't have access to its kubernetes API.

What happened instead?:

Cluster autoscaler by its autodiscovery feature, identifies MachineDeployments for the workload cluster and since it can't access the kuberntes API assumes the nodes are unregistered and deletes them after 15 minutes (by default, configured by --max-node-provision-time).

I0214 20:53:39.989780       1 static_autoscaler.go:388] 4 unregistered nodes present
I0214 20:53:40.558558       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:40.558597       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-036e48326d72b8596
I0214 20:53:41.161843       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.161875       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-000951411145ad6ff
I0214 20:53:41.162028       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.298995       1 leaderelection.go:278] successfully renewed lease kube-system/cluster-autoscaler
I0214 20:53:41.355255       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.355389       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.758912       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.758944       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-03be732f299d68aa4

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster via CAPI and CAPA and make it self-managed.
  2. Create a second cluster managed by this first cluster.
  3. Deploy cluster-autoscaler to the management cluster with the --cloud-provider=clusterapi flag. No further config required.
  4. Annotate the MachineDeployments with cluster.x-k8s.io/cluster-api-autoscaler-node-group-[min|max]-size.
  5. Check that in 15 minutes the MachineDeployment of the second cluster is scaled to its minimum even if the nodes are being utilized.

Anything else we need to know?:

It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig. We should document that and consider making the autodiscovery feature disabled by default.

That can lead to outages since the nodes might be under utilization.

I would like to work on this issue.

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerarea/provider/cluster-apiIssues or PRs related to Cluster API providerhelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions