Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨feat(awsmachinepool): custom lifecyclehooks for machinepools #4875

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sebltm
Copy link

@sebltm sebltm commented Mar 18, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:

This PR adds to the v1beta2 definition for the AWSMachinePool and AWSManagedMachinePool with a new field lifecycleHooks which is a list of:

name: <the name of the lifecycle hook>
notificationTargetARN: <ARN of resource where to send the lifecycle event; optional>
roleARN: <ARN of role to be used when sending notifications; optional>
lifecycleTransition: <autoscaling:EC2_INSTANCE_LAUNCHING/EC2_INSTANCE_TERMINATING>
heartbeatTimeout: <duration of the heartbeat timeout; optional>
defaultResult: <CONTINUE/ABANDON; optional>
notificationMetadata: <some metadata to add to the notification; optional>

The matching webhooks are updated to validate the lifecycle hooks as they are added to the Custom Resource.
The matching reconcilers are updated to enable reconciling those lifecycle hooks: if the lifecycle hook is present in the Custom Resource but not in the cloud, it is created. And if there is a lifecycle hook present in the cloud but not declared in the Custom Resource then it is removed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4020

AWS supports Lifecycle Hooks before/after performing certain actions on an ASG. For example, before scaling in (removing) a node, the ASG can publish an event in an SQS queue which can them be consumed by the node-termination-handler to ensure its proper removal from Kubernetes (it will cordon, drain the node and wait for a period of time for applications to be removed before allowing the Autoscaling Group to terminate the instance).

This allows Kubernetes or other components to be aware of the node's lifecycle and take appropriate actions

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Adding support for custom Lifecycle Hooks in AWSMachinePools for external hooks (e.g support for the aws-node-termination-handler with SQS)

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 18, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Mar 18, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @sebltm!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 18, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @sebltm. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sebltm sebltm changed the title feat(awsmachinepool): add the ability to add lifecycle hooks ✨feat(awsmachinepool): add the ability to add lifecycle hooks Mar 18, 2024
@sebltm sebltm marked this pull request as ready for review April 16, 2024 11:41
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024
@sebltm sebltm changed the title ✨feat(awsmachinepool): add the ability to add lifecycle hooks ✨feat(awsmachinepool): custom lifecyclehooks for machinepools May 10, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Jul 3, 2024

I have two requests before getting to the review:

  • Neither title nor PR description describe the change. Lifecycle hooks and reacting to node shutdown is great – but what is this PR doing and achieving? Also, the release note entry in the PR template must be filled.
  • You're moving lots of code. Please revert those changes as much as possible so the PR becomes reviewable. Refactoring and file renames can be done separately.

@AndiDog
Copy link
Contributor

AndiDog commented Jul 3, 2024

/assign

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 4, 2024
@sebltm
Copy link
Author

sebltm commented Jul 4, 2024

@AndiDog sorry I hadn't cleaned up the PR, I didn't know if it would get some traction :)
I've updated the PR, updated the description. Let me know if it looks good, I'll write some docs and add release notes

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 13, 2024
@sebltm
Copy link
Author

sebltm commented Jul 13, 2024

@AndiDog let me know if this looks good or if there's anything else I should take a look at :)

@AndiDog
Copy link
Contributor

AndiDog commented Jul 15, 2024

The PR is definitely reviewable now. I'm not much experienced with lifecycle hooks and aws-node-termination-handler (is that your actual use case?). Maybe MachinePool machines (#4527) give us a good way to detect node shutdown and have CAPI/CAPA take care of it? Or in other words: I'm not fully confident reviewing here with my knowledge, but maybe others have a better clue – please feel free to ping or discuss in Slack (#cluster-api-aws) so we can find someone to check this feature request.

@AndiDog
Copy link
Contributor

AndiDog commented Jul 15, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andidog. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sebltm
Copy link
Author

sebltm commented Sep 21, 2024

@AndiDog could you help me find someone to review this PR? I've posted a couple of times in Slack
The point of the PR isn't just to cater to node termination scenarios, but to enable the flexibility to run actions tied to the node lifecycle (e.g run a lambda to create certain resources or clean up resources on node scale-in/scale-out or when nodes enter or leave the warm pool)

Copy link
Contributor

@AndiDog AndiDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within my company, we managed today to talk about lifecycle hooks and how they could help with several CAPA features, including aws-node-termination-handler which you mentioned. So I'm feeling up to review it in some detail.

exp/api/v1beta2/awsmachinepool_webhook.go Outdated Show resolved Hide resolved
exp/api/v1beta2/awsmachinepool_types.go Outdated Show resolved Hide resolved
@@ -298,6 +298,21 @@ func (r *AWSMachinePoolReconciler) reconcileNormal(ctx context.Context, machineP
return nil
}

lifecycleHookScope, err := scope.NewLifecycleHookScope(scope.LifecycleHookScopeParams{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machinePoolScope should instead provide an interface function like func (*FooScope) LifecycleHooks() []AWSLifecycleHook – instead of introducing a new type that covers both EC2 and EKS based clusters in the same "class"

exp/controllers/awsmachinepool_controller.go Show resolved Hide resolved
@@ -163,13 +163,14 @@ func TestAWSMachinePoolReconciler(t *testing.T) {
recorder = record.NewFakeRecorder(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, there's no test covering the new functionality.

We need a non-mocked test, see

reconciler.reconcileServiceFactory = nil // use real implementation, but keep EC2 calls mocked (`ec2ServiceFactory`)

below where the actual EC2 calls are tested. The test should cover different situations, such as no hooks exist, all hooks exist, some hooks need an update, there's a hook too much which should be removed, ...

}
for _, hook := range hooks {
found := false
for _, definedHook := range scope.GetLifecycleHooks() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, definedHook := range scope.GetLifecycleHooks() {
for _, definedHook := range lifecyleHooks {

}
}
if !found {
scope.Info("Deleting lifecycle hook", "hook", hook.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scope.Info("Deleting lifecycle hook", "hook", hook.Name)
scope.Info("Deleting extraneous lifecycle hook", "hook", hook.Name)

}
}

conditions.MarkTrue(scope.GetMachinePool(), expinfrav1.LifecycleHookExistsCondition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also LifecycleHookReadyCondition? It's never marked as true (or false).

var sSGs = []string{}
sSGs := []string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until here, there were quite a few minor, unneeded changes, plus some small improvements, both out of scope for the PR. If you can put the relevant ones into a separate PR and ping me, I'll get them in. Let's please avoid making this already-large PR review slower by this extra content.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry these were auto-linted, I missed removing them from the PR, I'll clean those up

@@ -50,6 +50,12 @@ type ASGInterface interface {
SuspendProcesses(name string, processes []string) error
ResumeProcesses(name string, processes []string) error
SubnetIDs(scope *scope.MachinePoolScope) ([]string, error)
GetLifecycleHooks(scope scope.LifecycleHookScope) ([]*expinfrav1.AWSLifecycleHook, error)
GetLifecycleHook(scope scope.LifecycleHookScope, hook *expinfrav1.AWSLifecycleHook) (*expinfrav1.AWSLifecycleHook, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GetLifecycleHook(scope scope.LifecycleHookScope, hook *expinfrav1.AWSLifecycleHook) (*expinfrav1.AWSLifecycleHook, error)
GetLifecycleHook(scope scope.LifecycleHookScope, hookName string) (*expinfrav1.AWSLifecycleHook, error)

(minor)

@k8s-ci-robot
Copy link
Contributor

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

@sebltm: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-verify 2421ec3 link true /test pull-cluster-api-provider-aws-verify
pull-cluster-api-provider-aws-build 2421ec3 link true /test pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker 2421ec3 link true /test pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-apidiff-main 2421ec3 link false /test pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-test 2421ec3 link true /test pull-cluster-api-provider-aws-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sebltm
Copy link
Author

sebltm commented Oct 15, 2024

@AndiDog sorry for merging into the PR, but let me know if this approach looks better to you and I’ll clean this up with a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/contains-merge-commits needs-priority needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lifecycle Hooks for MachinePool/ASG
4 participants