Align Cluster Autoscaler taints with k8s naming #5433

x13n · 2023-01-19T18:35:26Z

Which component are you using?:

Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Existing CA taints don't follow the k8s naming pattern like our annotations do (cluster-autoscaler.kubernetes.io/...)

Describe the solution you'd like.:

We should migrate to new taints and document them on https://kubernetes.io/docs/reference/labels-annotations-taints/

Steps:

1. Introduce new taints in addition to the existing ones and start double-tainting scaled down nodes. Pay attention not to increase QPS to kube-apiserver, both new and old taint should be added in a single request.
2. Document the new taints on k8s website and indicate the change in CA release notes.
3. Wait 1-2 releases.
4. Drop support for the old labels.

Describe any alternative solutions you've considered.:

Additional context.:

See kubernetes/website#45074

The text was updated successfully, but these errors were encountered:

x13n · 2023-01-19T18:35:37Z

/help

k8s-ci-robot · 2023-01-19T18:35:43Z

@x13n:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gregth · 2023-02-06T00:10:19Z

@x13n I'd be glad to work on this issue, what is the timeline? How can we coordinate this?

x13n · 2023-02-07T10:33:31Z

I think implementation can start right away. Suggested timeline:

CA 1.27 is released with double-tainting. A flag can be used to restore previous behavior (might be useful if someone, somewhere has a toleration for the existing taints).
CA 1.28 is released with using only the new taint. A flag can be used to restore previous behavior (either only old or double-tainting).
CA 1.29+ - flag & old logic removed.

/cc @MaciekPytel @gjtempleton @mwielgus

Any concerns with this approach?

sftim · 2023-02-09T16:10:37Z

In case people are tolerating only the existing taints, do we need to communicate the change a bit more than usual? Eg article on https://blog.k8s.io/

x13n · 2023-02-10T15:02:11Z

I don't have a strong opinion on this. On one hand I'm not sure what would be the use case for scheduling pods on a node that is being deleted. On the other hand, I guess it is better to over-communicate. How do we make a release note appear on the blog?

sftim · 2023-02-10T15:09:41Z

It'd be a new blog article - see https://kubernetes.io/docs/contribute/new-content/blogs-case-studies/ (which is a bit wrong but is mostly correct).

It looks like none of the CA taints are the NoExecute type. If that's right, I don't think a blog article is needed.

gregth · 2023-02-11T14:00:26Z

Some thoughts for implementing double tainting:

We will need to support the case a user needs to keep the previous behavior (i.e., use the older taint only). To do so we have to pass a boolean option from the autoscaling options to taint-related functions. Currently, no autoscaling options are passed to these functions, so some restructuring might be needed.
We will modify our taint methods to update the node object at one request, i.e., no extra requests to the API Server.
In various places we use HasToBeDeletedTaint() function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:
1. Require the node to have either of the taints (logical OR)
2. Require the node to have both taints? (Logical AND)
The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.

Some more questions for the plan

I was reiterating our plan, and have some questions:

What is the purpose/use case of double tainting?
Can't we simply introduce the new taints, with a flag to switch back to the old taints for the next release, which we will gradually remove (the flag) in upcoming releases? In the meantime, we will notify users about the upcoming changes in our Changelog (so that they have time to prepare).

The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.

Cc: @x13n

x13n · 2023-02-15T15:12:34Z

Some thoughts for implementing double tainting:

We will need to support the case a user needs to keep the previous behavior (i.e., use the older taint only). To do so we have to pass a boolean option from the autoscaling options to taint-related functions. Currently, no autoscaling options are passed to these functions, so some restructuring might be needed.

I think we actually need not bool, but an enum (so, in practice - string) flag to allow one of 3 values: old / new / both.

We will modify our taint methods to update the node object at one request, i.e., no extra requests to the API Server.

In various places we use HasToBeDeletedTaint() function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:

Require the node to have either of the taints (logical OR)

Require the node to have both taints? (Logical AND)

The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.

I agree, OR seems to be the right check.

Some more questions for the plan

I was reiterating our plan, and have some questions:

What is the purpose/use case of double tainting?

Can't we simply introduce the new taints, with a flag to switch back to the old taints for the next release, which we will gradually remove (the flag) in upcoming releases? In the meantime, we will notify users about the upcoming changes in our Changelog (so that they have time to prepare).

The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.

Cc: @x13n

I think in case of tolerations you're right, double-tainting doesn't seem to provide a lot of benefit - there could be two tolerations on workloads to allow gradual rollout. The use case I had in mind was actually for other dependency: when some other controller (like this one) uses the taint as a signal that CA will be deleting the node. To allow such use cases to migrate, we give them two signals that will coexist for at least 1 release cycle.

Heads up @thockin @alexanderConstantinescu

fmuyassarov · 2023-05-23T10:55:10Z

Hi @x13n @gregth I see from the discussion here that implementation approach is not finalized yet.
I'm new to this project but I would be happy to give it a try if you think this issue is still relevant.

x13n · 2023-05-24T07:20:10Z

Thanks for volunteering! I think there was no discussion for a while now, so I'd proceed with the first step from my original description: double-tainting (flag gated and implemented within a single API call, as @gregth suggested). The flag should default to double tainting, but allow also "only new" and "only old".

x13n · 2023-06-23T10:12:09Z

In the light of kubernetes/org#4258, we should probably align the prefix between components, to avoid migrating twice. @MaciekPytel @mwielgus @gjtempleton - thoughts?

fmuyassarov · 2024-02-15T08:07:39Z

Sorry for such a late response on this. But I'm back now and starting today. As agreed above, I will have double-tainting with a flag to opt in for "only new" and "only old" or "both" as default.
/assign

x13n · 2024-02-15T09:41:48Z

Thanks! I think recently there were some discussions around unifying labels between CA and Karpenter now it is a part of the SIG. Need to make sure we don't need to migrate twice. @towca can you chime in on this? czw., 15 lut 2024, 09:07 użytkownik Feruzjon Muyassarov < ***@***.***> napisał:

…

Sorry for such a like response on this. But I'm back now and starting today. As agreed above, I will have double-tainting with a flag to opt in for "only new" and "only old" or "both" as default. /assign — Reply to this email directly, view it on GitHub <#5433 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFMP4DABDVHVS76PYQ2MSDYTW65PAVCNFSM6AAAAAAUAVOWPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBVGU2DMNJTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

fmuyassarov · 2024-02-22T09:27:56Z

@x13n I'm almost done with the change and want to check something regarding taint format before submitting the changes.
The new format looks like below:

ToBeDeletedTaint = "to-be-deleted.cluster-autoscaler.kubernetes.io/"
DeletionCandidateTaint = "deletion-candidate.cluster-autoscaler.kubernetes.io/"

Basically instead of using camelCase I have used - and added cluster-autoscaler.kubernetes.io/ suffix. Does the format look correct or as you expected ?

towca · 2024-02-23T17:26:52Z

Hey @fmuyassarov! As @x13n mentioned above, we're currently in discussions with Karpenter about unifying the common parts of our APIs. These taints are one such common part, and we still need to align on the exact API group and naming for them. Could you hold off your change until that happens? I can reach out when we have the naming ready. Unfortunately right now it's hard for me to predict when exactly that'll be (probably a couple of months), but I'd really like to avoid changing the API twice for this.

fmuyassarov · 2024-02-23T22:13:36Z

we're currently in discussions with Karpenter about unifying the common parts of our APIs.

Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?

fmuyassarov · 2024-02-26T09:59:50Z

/hold

Bryce-Soghigian · 2024-02-26T10:15:50Z

cc: @jonathan-innis, @ellistarn, @njtran

fmuyassarov · 2024-02-26T19:10:59Z

we're currently in discussions with Karpenter about unifying the common parts of our APIs.

Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?

I think I found it, probably the discussion on happening on this google doc: https://docs.google.com/document/d/1_KCCr5CzxmurFX_6TGLav6iMwCxPOj0P/edit

jonathan-innis · 2024-02-27T06:56:43Z

I think I found it, probably the discussion on happening on this google doc

Hey 👋🏼 chiming in here from the Karpenter maintainer team. I don't think we have really gone back to that doc for active discussion in a bit. I've been talking with @towca a bit on this but I agree that it's going to take a bit of time for both of the projects to come into an alignment. I'd be open to starting a discussion over in the K8s slack with everyone who wants to be involved. We could just do it in #sig-autoscaling channel, though I expect that it might be lost in conversation. It would be really nice if we could just call this a "working group discussion" and just get a "wg" channel started in the K8s slack so we can get all of the relevant parties involved and have an active discussion over there specifically on this topic.

towca · 2024-02-27T19:32:35Z

@jonathan-innis I'll look into setting up the proper communication channels to discuss this.

x13n added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 19, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 19, 2023

x13n mentioned this issue Jan 19, 2023

Add node lifecycle documentation kubernetes/website#45074

Open

jbartosik added the area/cluster-autoscaler label Jan 30, 2023

k8s-ci-robot assigned fmuyassarov Feb 15, 2024

towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024

towca mentioned this issue Mar 21, 2024

Cluster Autoscaler: align workload-level APIs with Karpenter #6648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Cluster Autoscaler taints with k8s naming #5433

Align Cluster Autoscaler taints with k8s naming #5433

x13n commented Jan 19, 2023

x13n commented Jan 19, 2023

k8s-ci-robot commented Jan 19, 2023

gregth commented Feb 6, 2023 •

edited

Loading

x13n commented Feb 7, 2023

sftim commented Feb 9, 2023

x13n commented Feb 10, 2023

sftim commented Feb 10, 2023

gregth commented Feb 11, 2023 •

edited

Loading

x13n commented Feb 15, 2023

Some thoughts for implementing double tainting:

Some more questions for the plan

fmuyassarov commented May 23, 2023

x13n commented May 24, 2023 •

edited

Loading

x13n commented Jun 23, 2023

fmuyassarov commented Feb 15, 2024 •

edited

Loading

x13n commented Feb 15, 2024 via email

fmuyassarov commented Feb 22, 2024

towca commented Feb 23, 2024

fmuyassarov commented Feb 23, 2024

fmuyassarov commented Feb 26, 2024

Bryce-Soghigian commented Feb 26, 2024

fmuyassarov commented Feb 26, 2024

jonathan-innis commented Feb 27, 2024

towca commented Feb 27, 2024

Align Cluster Autoscaler taints with k8s naming #5433

Align Cluster Autoscaler taints with k8s naming #5433

Comments

x13n commented Jan 19, 2023

x13n commented Jan 19, 2023

k8s-ci-robot commented Jan 19, 2023

Guidelines

gregth commented Feb 6, 2023 • edited Loading

x13n commented Feb 7, 2023

sftim commented Feb 9, 2023

x13n commented Feb 10, 2023

sftim commented Feb 10, 2023

gregth commented Feb 11, 2023 • edited Loading

Some thoughts for implementing double tainting:

Some more questions for the plan

x13n commented Feb 15, 2023

Some thoughts for implementing double tainting:

Some more questions for the plan

fmuyassarov commented May 23, 2023

x13n commented May 24, 2023 • edited Loading

x13n commented Jun 23, 2023

fmuyassarov commented Feb 15, 2024 • edited Loading

x13n commented Feb 15, 2024 via email

fmuyassarov commented Feb 22, 2024

towca commented Feb 23, 2024

fmuyassarov commented Feb 23, 2024

fmuyassarov commented Feb 26, 2024

Bryce-Soghigian commented Feb 26, 2024

fmuyassarov commented Feb 26, 2024

jonathan-innis commented Feb 27, 2024

towca commented Feb 27, 2024

gregth commented Feb 6, 2023 •

edited

Loading

gregth commented Feb 11, 2023 •

edited

Loading

x13n commented May 24, 2023 •

edited

Loading

fmuyassarov commented Feb 15, 2024 •

edited

Loading