Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1: Bring back a "soft expiration" mechanism #1750

Open
mikkoc opened this issue Oct 14, 2024 · 9 comments
Open

v1: Bring back a "soft expiration" mechanism #1750

mikkoc opened this issue Oct 14, 2024 · 9 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@mikkoc
Copy link

mikkoc commented Oct 14, 2024

Description

What problem are you trying to solve?

expireAfter should respect disruption budgets in v1, like it was in 0.37

We do want to keep our nodes "fresh" for security reasons, but we only want these rotations to happen during working hours, in order to minimise the chance (even if tiny) of something going wrong and getting paged during nights/weekends.

How important is this feature to you?

6 out of 10.
See: aws/karpenter-provider-aws#7122

@mikkoc mikkoc added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 14, 2024
@njtran
Copy link
Contributor

njtran commented Oct 14, 2024

Can you add more details about why this doesn't work for you? Why do you view expiration as a graceful mechanism rather than a forceful? What do you use it for? Is it better to use a different mechanism?

@mikkoc
Copy link
Author

mikkoc commented Oct 15, 2024

Can you add more details about why this doesn't work for you? Why do you view expiration as a graceful mechanism rather than a forceful? What do you use it for? Is it better to use a different mechanism?

Added more context. Maybe giving the user a choice between soft and forceful could be good?

@kristofferahl
Copy link

In our case it's simply a question of being able to control when expiration happens. So essentially having a maintenance window to make sure expirations don't happen during certain times of the day or only during weekends. But perhaps there are other mechanisms for achieving this?

@ulrichwinter
Copy link

We consider this feature as important for providing runtime environment, which is both secure and zero downtime.
What alternative do you see to enforce regular rotation of nodes to keep them up to date but without forcefully shutting down deployments?
Please consider this important feature request.

@dnmgns
Copy link

dnmgns commented Oct 15, 2024

We consider this feature as important for providing runtime environment, which is both secure and zero downtime.

What alternative do you see to enforce regular rotation of nodes to keep them up to date but without forcefully shutting down deployments?

Please consider this important feature request.

This matches our use case as well, where we would like to rotate some nodes every X hour and still have graceful shutdowns. We are getting interruptions for non-HA-compatible workload because of the current forceful mechanism. I don't see any good alternatives for rotating the nodes gracefully (except some solutions which would involve adding code and complexity on our end).

Maybe a good way forward would be to let the user specify graceful/forceful for expireAfter with an optional graceful termination timeout?

I guess the reason/motivation behind the current behavior is that karpenter wants to ensure that the node is really expired right away once the max lifetime is reached.

I also guess that one could argue that do-not-disrupt could be used for this. But the issue there is that it will block Karpenter from voluntarily choosing to disrupt certain nodes. While in the case where it would voluntarily choose to do so, it wouldn't cause forceful terminations. Right? So that doesn't seem like a good alternative either, as it would disturb the great workload/instance rebalancing actions of karpenter.

@jukie
Copy link

jukie commented Oct 17, 2024

I was unaware the expireAfter no longer respects PDBs, I thought nodes would be immediately marked for disruption but it would still perform terminations in a graceful manner. Is that not true? Also does terminationGracePeriod help with that scenario?

@dnmgns would adding a schedule option for do-not-disrupt help? I opened #1719 which might be related to your use case.

@sidewinder12s
Copy link

Yes, this is all about controlling when we introduce churn, even if that reason is to enforce policy.

@nonoswz
Copy link

nonoswz commented Nov 8, 2024

We have a similar use case and would also like to get back the previous expireAfter mechanisms:

  • to have either a sequential expiration (1 node after the other)
  • or a disruption budget specific to expiration (e.g. being able to add Expiry as a reason into existing budgets).

We would also like to get back a way to disable the expiry on existing nodes that are set to expire (e.g. If we need to disable node rotation during incidents). We used to be able to set expireAfter: Never and all existing nodes would automatically get their expiration set to Never. Since v1, existing nodes will still expire, except if we add the do-not-disrupt annotation temporarily (which is not as nice as you need to manage at the node level and not the nodepool) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

9 participants