Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI cluster autoscaler delete nodes from clusters it doesn't have access #5510

Open
jonathanbeber opened this issue Feb 14, 2023 · 15 comments
Labels
area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@jonathanbeber
Copy link

jonathanbeber commented Feb 14, 2023

Which component are you using?:

cluster-autoscaler for cluster API (CAPI).

What version of the component are you using?:

Component version: v.1.26.1

What k8s version are you using (kubectl version)?:

v.1.26.1

What environment is this in?:

CAPI clusters using cluster API Provider AWS (CAPA).

There are 2 clusters (at least) involved in this setup. One is the management cluster which is also managed by CAPI (self-managed). A second cluster is the workload cluster, managed by the management cluster.

+------------+             +----------+
|    mgmt    |<--          | workload |
| ---------- |   |         |          |
|    CAPI    +------------>|          |
+------------+             +----------+

In this setup cluster-autoscaler was not configured to have access to the workload cluster, it just has access to the management cluster kubernetes API via its service account. It has access to the CAPI objects of all clusters and tries to do that via its auto-discovery feature.

What did you expect to happen?:

I would expect cluster API to do not consider clusters where it doesn't have access to its kubernetes API.

What happened instead?:

Cluster autoscaler by its autodiscovery feature, identifies MachineDeployments for the workload cluster and since it can't access the kuberntes API assumes the nodes are unregistered and deletes them after 15 minutes (by default, configured by --max-node-provision-time).

I0214 20:53:39.989780       1 static_autoscaler.go:388] 4 unregistered nodes present
I0214 20:53:40.558558       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:40.558597       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-036e48326d72b8596
I0214 20:53:41.161843       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.161875       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-000951411145ad6ff
I0214 20:53:41.162028       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.298995       1 leaderelection.go:278] successfully renewed lease kube-system/cluster-autoscaler
I0214 20:53:41.355255       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.355389       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.758912       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.758944       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-03be732f299d68aa4

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster via CAPI and CAPA and make it self-managed.
  2. Create a second cluster managed by this first cluster.
  3. Deploy cluster-autoscaler to the management cluster with the --cloud-provider=clusterapi flag. No further config required.
  4. Annotate the MachineDeployments with cluster.x-k8s.io/cluster-api-autoscaler-node-group-[min|max]-size.
  5. Check that in 15 minutes the MachineDeployment of the second cluster is scaled to its minimum even if the nodes are being utilized.

Anything else we need to know?:

It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig. We should document that and consider making the autodiscovery feature disabled by default.

That can lead to outages since the nodes might be under utilization.

I would like to work on this issue.

@jonathanbeber jonathanbeber added the kind/bug Categorizes issue or PR as related to a bug. label Feb 14, 2023
@jonathanbeber jonathanbeber changed the title CAPI cluster autoscaler delete nodes from clusters where it doesn't have access CAPI cluster autoscaler delete nodes from clusters it doesn't have access Feb 14, 2023
@elmiko
Copy link
Contributor

elmiko commented Feb 21, 2023

is the autoscaler managing a single cluster or all the clusters? oops, i saw topology later

in general, we do not recommend running a single autoscaler to cover multiple clusters. but, with that said, i would expect it to be ok for the autoscaler to manage a single cluster even if the workload hosts multiple clusters. (if the capi resources are isolated)

It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig.

this is by design, but also we do allow specify multiple kubeconfig, but only for a single "cluster" deployment. eg one config for management and one config for workload.

We should document that and consider making the autodiscovery feature disabled by default.

agreed about documenting this. i'm not sure we want to disable autodiscovery as users have come to expect that the autoscaler works that way, but i do wonder if we couldn't be doing more to identify the cluster owner. i'm not sure that is desirable, but something to think about.

@jonathanbeber
Copy link
Author

thank you for taking a look, @elmiko.

agreed about documenting this.

Ok, perfect, I could already start working at least on this one. I guess the docs should alert that if the management cluster manages multiple clusters we recomend multiple cluster autoscaler (CA) deployments, and each instance of CA must filter its own cluster with --node-group-auto-discovery=clusterapi:clusterName=${CLUSTER_NAME}. WDYT?

i'm not sure we want to disable autodiscovery as users have come to expect that the autoscaler works that way, but i do wonder if we couldn't be doing more to identify the cluster owner. i'm not sure that is desirable, but something to think about.

I think I'm missing some context in here, but I don't see any benefit of autodiscovery for multiple clusters since CA won't be able to manage more than one cluster. I agree with you that autodiscovery for nodepools in the same cluster is something I would expect.

I'm thinking of two possible solutions in here, since IMHO just documentation is not a valid solution, because it can lead to serious incidents deleting nodes with some workload.

  1. If nodepools of two different clusters are identified the cluster autoscaler refuses to downscale unregistered nodes (I don't like it very much);
  2. The cluster autoscaler reads the annotation cluster.x-k8s.io/cluster-name from a registered node before accepting to manage a nodepool. This way we know that CA has access to the right Kubernetes API.

@elmiko
Copy link
Contributor

elmiko commented Feb 22, 2023

Ok, perfect, I could already start working at least on this one. I guess the docs should alert that if the management cluster manages multiple clusters we recomend multiple cluster autoscaler (CA) deployments, and each instance of CA must filter its own cluster with --node-group-auto-discovery=clusterapi:clusterName=${CLUSTER_NAME}. WDYT?

+1, starting with a docs patch would be great imo. instructing users to filter on their clusterName as part of the autodiscovery also seems like a win.

I think I'm missing some context in here, but I don't see any benefit of autodiscovery for multiple clusters since CA won't be able to manage more than one cluster. I agree with you that autodiscovery for nodepools in the same cluster is something I would expect.

yeah, i wasn't suggesting that we should make it more tolerant to multi-cluster setups. i was saying what you suggested about adding the clusterName into the autodiscovery in a less eloquent manner ;)

I'm thinking of two possible solutions in here, since IMHO just documentation is not a valid solution, because it can lead to serious incidents deleting nodes with some workload.

1. If nodepools of two different clusters are identified the cluster autoscaler refuses to downscale unregistered nodes (I don't like it very much);

2. The cluster autoscaler reads the annotation `cluster.x-k8s.io/cluster-name` from a registered node before accepting to manage a nodepool. This way we know that CA has access to the right Kubernetes API.

i like option two, but then i wonder how would that work in a scale from zero scenario?

@jonathanbeber
Copy link
Author

but then i wonder how would that work in a scale from zero scenario?

I don't think scaling from zero would be a problem, because at least one control plane has to be registered right? I can't think of a Kubernetes cluster that can scale to 0 all its nodes, I can imagine scaling to 0 just some nodepools. Do you think I might be missing something?

@elmiko
Copy link
Contributor

elmiko commented Feb 22, 2023

i was thinking that perhaps we would need a way to signal the cluster-name from the scalable resource so that when a node group is scaling from zero replicas the autoscaler could ensure that it is focused on the proper node group. but, i think you are correct about grabbing the cluster name from a control plane node or similar, that should be relatively safe.

one of my main concerns here also is to ensure that we don't make this too impenetrable for others to understand, ideally we will solve this in a manner that is clear for users to understand how the cluster name factors in to the decisions.

@jonathanbeber
Copy link
Author

one of my main concerns here also is to ensure that we don't make this too impenetrable for others to understand, ideally we will solve this in a manner that is clear for users to understand how the cluster name factors in to the decisions.

That's a very good concern!

I would like very much to have it clear for users too, but I can't think of any other way to make it clearer but documentation and good logs that will say something like "nodepool foo-bar is not considered for scaling up/down because it refers to a cluster different from the one identified in the workload cluster". I believe it is still not 100% clear, bu the term workload cluster is defined in the docs, the same docs we will provider further details on how we identify the cluster name (via the label in any registered node).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2023
@jonathanbeber
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 12, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@jimmidyson
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2024
@elmiko
Copy link
Contributor

elmiko commented Jan 26, 2024

i think this is a good issue to keep open, we just need to find someone who might want to propose the docs and logging change.

@towca towca added the area/provider/cluster-api Issues or PRs related to Cluster API provider label Mar 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2024
@elmiko
Copy link
Contributor

elmiko commented Jun 20, 2024

still would like to see this fixed

/remove-lifecycle stale
/help

@k8s-ci-robot
Copy link
Contributor

@elmiko:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

still would like to see this fixed

/remove-lifecycle stale
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants