Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Add an option to recover panic for controller reconcile #1627

Conversation

FillZpp
Copy link
Contributor

@FillZpp FillZpp commented Aug 9, 2021

Signed-off-by: FillZpp FillZpp.pub@gmail.com

Add an option to recover panic for controller reconcile. It ensures that any panics raised by controller reconciler are converted to normal errors.

fixes #797

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 9, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @FillZpp. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 9, 2021
@FillZpp
Copy link
Contributor Author

FillZpp commented Aug 9, 2021

/cc @alvaroaleman @joelanford

@varshaprasad96
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 9, 2021
@@ -321,6 +322,15 @@ func (c *Controller) reconcileHandler(ctx context.Context, obj interface{}) {
}
}

func (c *Controller) doReconcile(ctx context.Context, req reconcile.Request) (_ reconcile.Result, err error) {
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be optional and opt-in, similiar to what upstream does: https://github.com/kubernetes/apimachinery/blob/f916759cb6b8547418dc7708876ecab5c1961448/pkg/util/runtime/runtime.go#L29

Also, please add tests for this.

Copy link
Contributor Author

@FillZpp FillZpp Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I have added tests and changed all recover() to utilruntime.HandleCrash(), so that developers can set it via utilruntime.ReallyCrash.

@FillZpp FillZpp force-pushed the handle-crash-for-controller-and-webhook branch from 3cffa52 to 76cdea9 Compare August 16, 2021 06:29
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 16, 2021
@@ -95,7 +95,10 @@ type watchDescription struct {
}

// Reconcile implements reconcile.Reconciler.
func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (_ reconcile.Result, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I meant by the other comment was:

  1. Please add an option to Controller to make this configurable (i.E. RecoverPanic bool)
  2. Make the handling here conditional (i.E. if c.RecoverPanic { defer recover() })
  3. Expose the option in pkg/controller.Options

The reason for favoring that over the current approach is that with the current approach we require ppl to read internal packages to understand how they can get panics recovered, which is not good UX.

Sorry for being unclear about this before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand, I will fix it. So what do you think of webhook handler? Should its panic default to be recovered or another option field to configure it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also have an option to set this up and by default panic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. It is hard to add the option for webhook. Maybe we can firstly add an option for controller reconcile.

@FillZpp FillZpp force-pushed the handle-crash-for-controller-and-webhook branch from 76cdea9 to 8e0707e Compare August 18, 2021 02:51
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 18, 2021
@FillZpp FillZpp changed the title ✨Recover panic for controller reconciler and webhook handler ✨Add an option to recover panic for controller reconcile Aug 18, 2021
for _, fn := range utilruntime.PanicHandlers {
fn(r)
}
err = fmt.Errorf("panic: %v [recovered]", r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the stack here, otherwise this is almost impossible to debug? From runtime/debug.Stack()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the utilruntime.PanicHandlers above can print the stack of panic more properly
https://github.com/kubernetes/apimachinery/blob/f916759cb6b8547418dc7708876ecab5c1961448/pkg/util/runtime/runtime.go#L61

})
_, _ = ctrl.Reconcile(ctx,
reconcile.Request{NamespacedName: types.NamespacedName{Namespace: "foo", Name: "bar"}})
By("this should not be called")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this needed? The defer should make the test fail if it doesn't work or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed this.

Signed-off-by: FillZpp <FillZpp.pub@gmail.com>
@FillZpp FillZpp force-pushed the handle-crash-for-controller-and-webhook branch from 8e0707e to 9a2a972 Compare August 19, 2021 02:47
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, FillZpp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5de246b into kubernetes-sigs:master Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconciler panics should not crash the manager
4 participants