Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out of service taint remediation #86

Merged
merged 2 commits into from
Mar 24, 2023

Conversation

k-keiichi-rh
Copy link
Contributor

@k-keiichi-rh k-keiichi-rh commented Feb 2, 2023

This PR is adding a new remediation strategy based on kubernetes/enhancements#1116
In k8s 1.26, kubernetes/enhancements#1116 is a Beta feature. This feature will be graduated to GA in k8s 1.27.

This PR is implemented based on the discussion in #17.
The following is the new remediation strategy for the out-of-service taint:

  1. When detecting node failure by NHC, SNR sets NoExecute taint on the failed Node
  2. The failed node is rebooted by watchdog
  3. SNR waits for a fixed time to complete rebooting the failed node
  4. SNR sets the out-of-service taint
    => This taint expects that the node is in shutdown or power off state (not in the middle of restarting) according to the enhancement
    However, SNR works well with this taint as we discussed in [RFC] Support for out-of-service taint #17 (comment).
  5. If the failed node is rebooted and there is no terminating pods and volumeattachements, SNR deletes the out-of-service taint.
    => SNR waits for deleting workloads on the failed node with a timer. If the timer is expired, the reconcile method is triggered.
  6. The fencing complete is set
  7. After setting the fencing complete, it starts cleaning up and then the NoExecute taint is deleted and the failed node becomes schedulable again
  8. The remediation is completed.

ECOPROJECT-1228

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 2, 2023

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k-keiichi-rh k-keiichi-rh force-pushed the kep2268 branch 2 times, most recently from 385e0b7 to bb5e5e4 Compare February 16, 2023 01:52
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time to follow the remdiation flow. I'd suggest to reorder methods, and maybe rename some of them in order to get rid of too many methods with similar names (remediate/remove/deleteResources)?

e.g.:

Reconcile

remediateWithResourceDeletion
removeNodeResourcesByResourceDeletionWrapper -> deleteResourcesWrapper (not nice that we need this wrapper, but I have no quick better idea)
removeNodeResourcesByResourceDeletion -> deleteResources (much easier and better to understand IMHO)

remediateByAddingOutOfServiceTaint -> remediateWithOutOfServiceTaint (align name with remediateWithResourceDeletion)
removeNodeResourcesByAddingOutOfServiceTaint -> useOutOfServiceTaint (much easier and better to understand IMHO)

remediateWithResourceRemoval

@k-keiichi-rh k-keiichi-rh changed the title [WIP] Add out of service taint remediation Add out of service taint remediation Mar 16, 2023
@slintes
Copy link
Member

slintes commented Mar 16, 2023

thanks a lot for renaming / reordering, much better readable now IMHO 🙂

/ok-to-test

@k-keiichi-rh
Copy link
Contributor Author

/retest-required

1 similar comment
@k-keiichi-rh
Copy link
Contributor Author

/retest-required

@k-keiichi-rh
Copy link
Contributor Author

/test 4.11-openshift-e2e

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit inside, otherwise lgtm. Needs rebase though.

Signed-off-by: Keiichi Kii <k-keiichi@nec.com>
Signed-off-by: Keiichi Kii <kkii@nec.com>
Signed-off-by: Keiichi Kii <k-keiichi@nec.com>
Signed-off-by: Keiichi Kii <kkii@nec.com>
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice work!
Leaving the final review to @mshitrit

@openshift-ci openshift-ci bot added the lgtm label Mar 22, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: k-keiichi-rh, slintes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member

slintes commented Mar 22, 2023

couldn't get rid of the "change requested" state in another way 🤷🏼‍♂️

/lgtm cancel

@openshift-ci openshift-ci bot removed the lgtm label Mar 22, 2023
@k-keiichi-rh
Copy link
Contributor Author

/test 4.13-openshift-e2e

@mshitrit
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 24, 2023
@openshift-merge-robot openshift-merge-robot merged commit 4b6a30a into medik8s:main Mar 24, 2023
Shai1-Levi pushed a commit to Shai1-Levi/self-node-remediation that referenced this pull request Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants