Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CP node deletion param configurable #6571

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

gandhipr
Copy link
Contributor

@gandhipr gandhipr commented Feb 27, 2024

What type of PR is this?

feature

What this PR does / why we need it:

Making cloud-provider node deletion param configurable so all cloud-providers can set this as per their need if needed.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

maxCloudProviderNodeDeletionTime (max-cloud-provider-node-deletion-time) can now be configured externally by the users.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 27, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 16, 2024
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see maxCloudProviderNodeDeletionTime propagated into / captured in AutoscalingOptions?

cluster-autoscaler/main.go Outdated Show resolved Hide resolved
@tallaxes
Copy link
Contributor

@gandhipr, please also update the user-facing section in the description. I am pretty sure it becomes part of the release notes, so needs to document the new flag (even if it has a default value that preserves current behavior).

@gandhipr gandhipr force-pushed the prachigandhi/make-cp-node-deletion-configurable branch from cc5b031 to dabc007 Compare April 2, 2024 00:52
Copy link

linux-foundation-easycla bot commented Apr 2, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2024
@gandhipr gandhipr force-pushed the prachigandhi/make-cp-node-deletion-configurable branch from dabc007 to 4c0950e Compare April 2, 2024 00:54
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 2, 2024
cluster-autoscaler/main.go Show resolved Hide resolved
@jackfrancis
Copy link
Contributor

/test ls

@k8s-ci-robot
Copy link
Contributor

@jackfrancis: The specified target(s) for /test were not found.
The following commands are available to trigger optional jobs:

  • /test pull-cluster-autoscaler-e2e-azure

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

This seems sensible to me.

cc @MaciekPytel @gjtempleton @elmiko

I was going to request a helm chart update to include this new config option, but I noticed that there are a bunch of options not covered by the chart. I can do a follow-up PR to do all of that work inclusive of this new change.

Also is there some additional documentation reinforcement we can do to ensure that the configurable surface area is fully documented, I see gaps there as well.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 1, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gandhipr, jackfrancis, rakechill, tallaxes
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Contributor

ping @MaciekPytel @gjtempleton @towca for approval as this touches the general cloudprovider code (not just Azure)

Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm maybe missing something here, the PR description describes allowing cloud-providers to set this as necessary, however it's being exposed as a user configurable flag (and the mention of following up with helm chart changes backs this up).

What's the motivation here? Users with nodes which may require extended times for safe node shutdowns?

frequentLoopsEnabled = flag.Bool("frequent-loops-enabled", false, "Whether clusterautoscaler triggers new iterations more frequently when it's needed")
provisioningRequestsEnabled = flag.Bool("enable-provisioning-requests", false, "Whether the clusterautoscaler will be handling the ProvisioningRequest CRs.")
frequentLoopsEnabled = flag.Bool("frequent-loops-enabled", false, "Whether clusterautoscaler triggers new iterations more frequently when it's needed")
maxCloudProviderNodeDeletionTime = flag.Duration("max-cloud-provider-node-deletion-time", 5*time.Minute, "Maximum time needed by cloud provider to delete a node.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding this in as a new flag we should update the ridiculously long table of options in the FAQ as well.

(We should really be auto-generating that, have raised an issue in the hope someone picks that up.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this make sense to me, but i agree with @gjtempleton 's comments about the description and the updates to the FAQ.

frequentLoopsEnabled = flag.Bool("frequent-loops-enabled", false, "Whether clusterautoscaler triggers new iterations more frequently when it's needed")
provisioningRequestsEnabled = flag.Bool("enable-provisioning-requests", false, "Whether the clusterautoscaler will be handling the ProvisioningRequest CRs.")
frequentLoopsEnabled = flag.Bool("frequent-loops-enabled", false, "Whether clusterautoscaler triggers new iterations more frequently when it's needed")
maxCloudProviderNodeDeletionTime = flag.Duration("max-cloud-provider-node-deletion-time", 5*time.Minute, "Maximum time needed by cloud provider to delete a node.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@jackfrancis
Copy link
Contributor

thx @gjtempleton @elmiko

What's the motivation here? Users with nodes which may require extended times for safe node shutdowns?

Right, I think the triggers that occur after node deletion timeout occurs don't cover the rare edge case behaviors where infra + node deletion takes longer than 5 mins, so making this configurable enables providers to tweak that for certain use cases.

Sounds like the action items are:

  1. updated description
  2. helm chart integration
  3. docs update

I'm not sure if @gandhipr is able to move this forward, if that's correct we'll move this to a new PR w/ the above feedback incorporated. Stay tuned!

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants