v1: Bring back a "soft expiration" mechanism #1750

mikkoc · 2024-10-14T07:30:27Z

Description

What problem are you trying to solve?

expireAfter should respect disruption budgets in v1, like it was in 0.37

We do want to keep our nodes "fresh" for security reasons, but we only want these rotations to happen during working hours, in order to minimise the chance (even if tiny) of something going wrong and getting paged during nights/weekends.

How important is this feature to you?

6 out of 10.
See: aws/karpenter-provider-aws#7122

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-10-14T07:30:35Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

njtran · 2024-10-14T20:58:01Z

Can you add more details about why this doesn't work for you? Why do you view expiration as a graceful mechanism rather than a forceful? What do you use it for? Is it better to use a different mechanism?

mikkoc · 2024-10-15T08:00:36Z

Can you add more details about why this doesn't work for you? Why do you view expiration as a graceful mechanism rather than a forceful? What do you use it for? Is it better to use a different mechanism?

Added more context. Maybe giving the user a choice between soft and forceful could be good?

kristofferahl · 2024-10-15T08:04:21Z

In our case it's simply a question of being able to control when expiration happens. So essentially having a maintenance window to make sure expirations don't happen during certain times of the day or only during weekends. But perhaps there are other mechanisms for achieving this?

ulrichwinter · 2024-10-15T08:51:57Z

We consider this feature as important for providing runtime environment, which is both secure and zero downtime.
What alternative do you see to enforce regular rotation of nodes to keep them up to date but without forcefully shutting down deployments?
Please consider this important feature request.

dnmgns · 2024-10-15T21:05:15Z

We consider this feature as important for providing runtime environment, which is both secure and zero downtime.

What alternative do you see to enforce regular rotation of nodes to keep them up to date but without forcefully shutting down deployments?

Please consider this important feature request.

This matches our use case as well, where we would like to rotate some nodes every X hour and still have graceful shutdowns. We are getting interruptions for non-HA-compatible workload because of the current forceful mechanism. I don't see any good alternatives for rotating the nodes gracefully (except some solutions which would involve adding code and complexity on our end).

Maybe a good way forward would be to let the user specify graceful/forceful for expireAfter with an optional graceful termination timeout?

I guess the reason/motivation behind the current behavior is that karpenter wants to ensure that the node is really expired right away once the max lifetime is reached.

I also guess that one could argue that do-not-disrupt could be used for this. But the issue there is that it will block Karpenter from voluntarily choosing to disrupt certain nodes. While in the case where it would voluntarily choose to do so, it wouldn't cause forceful terminations. Right? So that doesn't seem like a good alternative either, as it would disturb the great workload/instance rebalancing actions of karpenter.

jukie · 2024-10-17T22:19:05Z

I was unaware the expireAfter no longer respects PDBs, I thought nodes would be immediately marked for disruption but it would still perform terminations in a graceful manner. Is that not true? Also does terminationGracePeriod help with that scenario?

@dnmgns would adding a schedule option for do-not-disrupt help? I opened #1719 which might be related to your use case.

sidewinder12s · 2024-10-30T01:37:11Z

Yes, this is all about controlling when we introduce churn, even if that reason is to enforce policy.

nonoswz · 2024-11-08T20:02:08Z

We have a similar use case and would also like to get back the previous expireAfter mechanisms:

to have either a sequential expiration (1 node after the other)
or a disruption budget specific to expiration (e.g. being able to add Expiry as a reason into existing budgets).

We would also like to get back a way to disable the expiry on existing nodes that are set to expire (e.g. If we need to disable node rotation during incidents). We used to be able to set expireAfter: Never and all existing nodes would automatically get their expiration set to Never. Since v1, existing nodes will still expire, except if we add the do-not-disrupt annotation temporarily (which is not as nice as you need to manage at the node level and not the nodepool) ?

mikkoc added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 14, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1: Bring back a "soft expiration" mechanism #1750

v1: Bring back a "soft expiration" mechanism #1750

mikkoc commented Oct 14, 2024 •

edited

Loading

k8s-ci-robot commented Oct 14, 2024

njtran commented Oct 14, 2024

mikkoc commented Oct 15, 2024

kristofferahl commented Oct 15, 2024

ulrichwinter commented Oct 15, 2024

dnmgns commented Oct 15, 2024 •

edited

Loading

jukie commented Oct 17, 2024

sidewinder12s commented Oct 30, 2024

nonoswz commented Nov 8, 2024 •

edited

Loading

v1: Bring back a "soft expiration" mechanism #1750

v1: Bring back a "soft expiration" mechanism #1750

Comments

mikkoc commented Oct 14, 2024 • edited Loading

Description

k8s-ci-robot commented Oct 14, 2024

njtran commented Oct 14, 2024

mikkoc commented Oct 15, 2024

kristofferahl commented Oct 15, 2024

ulrichwinter commented Oct 15, 2024

dnmgns commented Oct 15, 2024 • edited Loading

jukie commented Oct 17, 2024

sidewinder12s commented Oct 30, 2024

nonoswz commented Nov 8, 2024 • edited Loading

mikkoc commented Oct 14, 2024 •

edited

Loading

dnmgns commented Oct 15, 2024 •

edited

Loading

nonoswz commented Nov 8, 2024 •

edited

Loading