Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align Cluster Autoscaler taints with k8s naming #5433

Open
4 tasks
x13n opened this issue Jan 19, 2023 · 22 comments
Open
4 tasks

Align Cluster Autoscaler taints with k8s naming #5433

x13n opened this issue Jan 19, 2023 · 22 comments
Assignees
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@x13n
Copy link
Member

x13n commented Jan 19, 2023

Which component are you using?:

Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Existing CA taints don't follow the k8s naming pattern like our annotations do (cluster-autoscaler.kubernetes.io/...)

Describe the solution you'd like.:

We should migrate to new taints and document them on https://kubernetes.io/docs/reference/labels-annotations-taints/

Steps:

  • 1. Introduce new taints in addition to the existing ones and start double-tainting scaled down nodes. Pay attention not to increase QPS to kube-apiserver, both new and old taint should be added in a single request.
  • 2. Document the new taints on k8s website and indicate the change in CA release notes.
  • 3. Wait 1-2 releases.
  • 4. Drop support for the old labels.

Describe any alternative solutions you've considered.:

Additional context.:

See kubernetes/website#45074

@x13n x13n added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 19, 2023
@x13n
Copy link
Member Author

x13n commented Jan 19, 2023

/help

@k8s-ci-robot
Copy link
Contributor

@x13n:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jan 19, 2023
@gregth
Copy link
Contributor

gregth commented Feb 6, 2023

@x13n I'd be glad to work on this issue, what is the timeline? How can we coordinate this?

@x13n
Copy link
Member Author

x13n commented Feb 7, 2023

I think implementation can start right away. Suggested timeline:

  1. CA 1.27 is released with double-tainting. A flag can be used to restore previous behavior (might be useful if someone, somewhere has a toleration for the existing taints).
  2. CA 1.28 is released with using only the new taint. A flag can be used to restore previous behavior (either only old or double-tainting).
  3. CA 1.29+ - flag & old logic removed.

/cc @MaciekPytel @gjtempleton @mwielgus

Any concerns with this approach?

@sftim
Copy link
Contributor

sftim commented Feb 9, 2023

In case people are tolerating only the existing taints, do we need to communicate the change a bit more than usual? Eg article on https://blog.k8s.io/

@x13n
Copy link
Member Author

x13n commented Feb 10, 2023

I don't have a strong opinion on this. On one hand I'm not sure what would be the use case for scheduling pods on a node that is being deleted. On the other hand, I guess it is better to over-communicate. How do we make a release note appear on the blog?

@sftim
Copy link
Contributor

sftim commented Feb 10, 2023

It'd be a new blog article - see https://kubernetes.io/docs/contribute/new-content/blogs-case-studies/ (which is a bit wrong but is mostly correct).

It looks like none of the CA taints are the NoExecute type. If that's right, I don't think a blog article is needed.

@gregth
Copy link
Contributor

gregth commented Feb 11, 2023

Some thoughts for implementing double tainting:

  1. We will need to support the case a user needs to keep the previous behavior (i.e., use the older taint only). To do so we have to pass a boolean option from the autoscaling options to taint-related functions. Currently, no autoscaling options are passed to these functions, so some restructuring might be needed.

  2. We will modify our taint methods to update the node object at one request, i.e., no extra requests to the API Server.

  3. In various places we use HasToBeDeletedTaint() function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:

    1. Require the node to have either of the taints (logical OR)
    2. Require the node to have both taints? (Logical AND)

    The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.

Some more questions for the plan

I was reiterating our plan, and have some questions:

  1. What is the purpose/use case of double tainting?
  2. Can't we simply introduce the new taints, with a flag to switch back to the old taints for the next release, which we will gradually remove (the flag) in upcoming releases? In the meantime, we will notify users about the upcoming changes in our Changelog (so that they have time to prepare).

The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.

Cc: @x13n

@x13n
Copy link
Member Author

x13n commented Feb 15, 2023

Some thoughts for implementing double tainting:

  1. We will need to support the case a user needs to keep the previous behavior (i.e., use the older taint only). To do so we have to pass a boolean option from the autoscaling options to taint-related functions. Currently, no autoscaling options are passed to these functions, so some restructuring might be needed.

I think we actually need not bool, but an enum (so, in practice - string) flag to allow one of 3 values: old / new / both.

  1. We will modify our taint methods to update the node object at one request, i.e., no extra requests to the API Server.

  2. In various places we use HasToBeDeletedTaint() function, to check if the node is marked to be deleted. Since we have two taints, how do we consider a taint to be marked for deletion? I can think of two options:

    1. Require the node to have either of the taints (logical OR)
    2. Require the node to have both taints? (Logical AND)

    The autoscaler will put both the old and new taint at one request, but what if the user somehow removes one of them? In that context, it might be safer to check if at least one of them exists, and does not require both. Please share your opinion on this.

I agree, OR seems to be the right check.

Some more questions for the plan

I was reiterating our plan, and have some questions:

  1. What is the purpose/use case of double tainting?
  2. Can't we simply introduce the new taints, with a flag to switch back to the old taints for the next release, which we will gradually remove (the flag) in upcoming releases? In the meantime, we will notify users about the upcoming changes in our Changelog (so that they have time to prepare).

The scenario I can think of is that users' Pods are tolerating the existing taint, so they will need to tolerate the new taint too. In that case, if we want to keep backward compatibility, they will need to pass a flag and ask the autoscaler to use the old taints. If we use double tainting, how does the user benefit from it? Please provide any context I'm missing.

Cc: @x13n

I think in case of tolerations you're right, double-tainting doesn't seem to provide a lot of benefit - there could be two tolerations on workloads to allow gradual rollout. The use case I had in mind was actually for other dependency: when some other controller (like this one) uses the taint as a signal that CA will be deleting the node. To allow such use cases to migrate, we give them two signals that will coexist for at least 1 release cycle.

Heads up @thockin @alexanderConstantinescu

@fmuyassarov
Copy link
Member

Hi @x13n @gregth I see from the discussion here that implementation approach is not finalized yet.
I'm new to this project but I would be happy to give it a try if you think this issue is still relevant.

@x13n
Copy link
Member Author

x13n commented May 24, 2023

Thanks for volunteering! I think there was no discussion for a while now, so I'd proceed with the first step from my original description: double-tainting (flag gated and implemented within a single API call, as @gregth suggested). The flag should default to double tainting, but allow also "only new" and "only old".

@x13n
Copy link
Member Author

x13n commented Jun 23, 2023

In the light of kubernetes/org#4258, we should probably align the prefix between components, to avoid migrating twice. @MaciekPytel @mwielgus @gjtempleton - thoughts?

@fmuyassarov
Copy link
Member

fmuyassarov commented Feb 15, 2024

Sorry for such a late response on this. But I'm back now and starting today. As agreed above, I will have double-tainting with a flag to opt in for "only new" and "only old" or "both" as default.
/assign

@x13n
Copy link
Member Author

x13n commented Feb 15, 2024 via email

@fmuyassarov
Copy link
Member

@x13n I'm almost done with the change and want to check something regarding taint format before submitting the changes.
The new format looks like below:

ToBeDeletedTaint = "to-be-deleted.cluster-autoscaler.kubernetes.io/"
DeletionCandidateTaint = "deletion-candidate.cluster-autoscaler.kubernetes.io/"

Basically instead of using camelCase I have used - and added cluster-autoscaler.kubernetes.io/ suffix. Does the format look correct or as you expected ?

@towca
Copy link
Collaborator

towca commented Feb 23, 2024

Hey @fmuyassarov! As @x13n mentioned above, we're currently in discussions with Karpenter about unifying the common parts of our APIs. These taints are one such common part, and we still need to align on the exact API group and naming for them. Could you hold off your change until that happens? I can reach out when we have the naming ready. Unfortunately right now it's hard for me to predict when exactly that'll be (probably a couple of months), but I'd really like to avoid changing the API twice for this.

@fmuyassarov
Copy link
Member

we're currently in discussions with Karpenter about unifying the common parts of our APIs.

Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?

@fmuyassarov
Copy link
Member

/hold

@Bryce-Soghigian
Copy link
Member

cc: @jonathan-innis, @ellistarn, @njtran

@fmuyassarov
Copy link
Member

we're currently in discussions with Karpenter about unifying the common parts of our APIs.

Hey @towca. Yes, it totally makes sense to keep it on hold for now and thanks for the update. Is the discussion with the Karpenter community being tracked somewhere on GitHub where I can follow it too?

I think I found it, probably the discussion on happening on this google doc: https://docs.google.com/document/d/1_KCCr5CzxmurFX_6TGLav6iMwCxPOj0P/edit

@jonathan-innis
Copy link

I think I found it, probably the discussion on happening on this google doc

Hey 👋🏼 chiming in here from the Karpenter maintainer team. I don't think we have really gone back to that doc for active discussion in a bit. I've been talking with @towca a bit on this but I agree that it's going to take a bit of time for both of the projects to come into an alignment. I'd be open to starting a discussion over in the K8s slack with everyone who wants to be involved. We could just do it in #sig-autoscaling channel, though I expect that it might be lost in conversation. It would be really nice if we could just call this a "working group discussion" and just get a "wg" channel started in the K8s slack so we can get all of the relevant parties involved and have an active discussion over there specifically on this topic.

@towca
Copy link
Collaborator

towca commented Feb 27, 2024

@jonathan-innis I'll look into setting up the proper communication channels to discuss this.

@towca towca added the area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/core-autoscaler Denotes an issue that is related to the core autoscaler and is not specific to any provider. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

9 participants