[feature] Maximum Nodes Per Zone #6940

rrangith · 2024-06-18T20:49:14Z

Which component are you using?:

cluster-autoscaler

Problem

The cluster-autoscaler currently supports a maximum number of nodes per cluster, but imposes no restrictions on the number of nodes per zone. There are scenarios, however, where it would be useful for the CA to restrict the maximum number of nodes per zone. For example, if there is an application with heavy bias towards one zone (or more generally if the cluster is unbalanced by any means), this can result in IP exhaustion in that zone. With the CA’s current behavior, it would not be aware that a zone is out of IPs, and therefore will continue to attempt scale ups in the exhausted zone. This would lead to build up of not ready nodes, leaving the cluster in a degraded state.

The autoscaler should allow for users to specify a maximum number of nodes per zone. With this feature, the autoscaler could prevent scaleups beyond the maximum number of nodes per zone, and scenarios such as IP exhaustion.

More generally, we would like a way to limit the number of nodes in a more granular way beyond just the maximum number of nodes per cluster. With the solutions below, we propose a way for users to customize how the autoscaler can limit the maximum number of nodes for a nodegroup.

Proposed Solution

Our first proposed solution is to allow for users to limit the maximum number of nodes per nodegroup via the autoscaler’s gRPC expander. In order to do so, the expander would include a list of similar nodegroups with its bestOption. This gives the ability for the gRPC expander to have custom logic that can filter out nodegroups based on certain characteristics such as if a zone is at capacity.

Currently during a scaleup CA will compute valid options, and will compute similar nodegroups for each option. Then it will ask the expander for the best option.

Then similar nodegroups get recomputed. The recomputation used to only occur when the bestOption nodegroup did not exist and got created. However it was changed in this PR to always recompute. This means that if an expander changes the SimilarNodeGroups of the bestOption, the result will be replaced by the recomputation.

In order to allow for users to limit the maximum number of nodes per nodegroup via the gRPC expander, we propose two changes made to the autoscaler.

The first, is to add a field to the Autoscaler’s gRPC Option request to include SimilarNodegroupIds. The Autoscaler’s Options struct already included SimilarNodegroups, we would just need to populate the proto request with the Similar Nodegroup IDs. Here is a PR which implements this. This would allow for user’s to filter both Options and the Options Similar Nodegroups in the gRPC expander. For example they could remove options that have reached their max nodes in their zone.

The second change is to allow the autoscaler to trust the SimilarNodegroups returned by the Expander, rather than recomputing the similar nodegroups after getting the best option. In order to do so, we should have a CLI option to trust expander’s similar NGs and skip the recomputation as long as the bestOption nodegroup exists. If it doesn’t exist, then we can create it and compute the similar options. If a user does not enable this option, then by default the behaviour will stay the same. This will only skip similar nodegroup recomputation for users who enable this option. Here is a PR which implements this.

With both of these changes, users can have their own max nodes per zone logic in their gRPC expander, by filtering out nodegroups from a zone that already has reached the max, while the default behaviour of the autoscaler would remain the same.

Overall, this gives more flexibility to gRPC expander users when picking a best option and its similar nodegroups. This flexibility can be used in a variety of usecases beyond just max nodes per zone. The disadvantage is that users will need to implement this logic on their side rather than relying to cluster-autoscaler to do it.

Alternative Solution

Another solution we considered involves putting the control logic in cluster-autoscaler. Overall, this solution requires much more code on the cluster-autoscaler side.

In order to enforce max nodes per zone, we must first understand that the autoscaler currently has no concept of a zone. It only has knowledge of nodegroups (ASG, VMSS, etc.) and the nodes that belong to them. Depending on the cloud provider, these nodegroups may or may not contain metadata that indicate in which zone they belong. Therefore, to enforce max nodes per zone, the implementation would be a more general “max nodes per nodegroup tag”.

This general feature can be applied to many other use cases which include:

Max number of nodes with a “spot” tag
Max number of nodes with a “gpu” tag
Max number of nodes for a certain instance type
Many more options

If a user does not specify any tags, then CA must behave the same as it currently does.

Implementation:
First, we filter out invalid nodegroups such as ones that have reached their maxsize. We would need to change this to also include nodegroups that belong to a “tag set” that has exceeded its max size.

Next, CA balances the desired nodes across similar nodegroups here. This function also checks the max size, so we would also add in the check to see if the tag set has enough space. If not, there will be a scaleup failure.

After this the scaleup can succeed.

In order for this implementation to hold, there are a few additional implementation details we would have to cover:
We need a cloud agnostic way to access the Tags on the Nodegroups
We need to keep track of (or efficiently compute) the count of nodes grouped by the specified tag set. We are not counting nodes by their Kubernetes labels, but instead their nodegroup tags.
With each ScaleUp call, we would have to know the current state of how many nodes there are per tag set.

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2024-06-19T12:31:39Z

/area cluster-autoscaler

k8s-triage-robot · 2024-09-17T13:24:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

BenHinthorne · 2024-10-07T14:01:56Z

/remove-lifecycle stale

rrangith added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 18, 2024

BenHinthorne linked a pull request Jun 18, 2024 that will close this issue

include similar nodegroups in gRPC expander options #6941

Open

rrangith mentioned this issue Jun 18, 2024

Add option to skip similar nodegroup recomputation #6926

Open

k8s-ci-robot added the area/cluster-autoscaler label Jun 19, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Maximum Nodes Per Zone #6940

[feature] Maximum Nodes Per Zone #6940

rrangith commented Jun 18, 2024 •

edited

Loading

adrianmoisey commented Jun 19, 2024

k8s-triage-robot commented Sep 17, 2024

BenHinthorne commented Oct 7, 2024

[feature] Maximum Nodes Per Zone #6940

[feature] Maximum Nodes Per Zone #6940

Comments

rrangith commented Jun 18, 2024 • edited Loading

Problem

Proposed Solution

Alternative Solution

adrianmoisey commented Jun 19, 2024

k8s-triage-robot commented Sep 17, 2024

BenHinthorne commented Oct 7, 2024

rrangith commented Jun 18, 2024 •

edited

Loading