Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than one defaulting webhook controller won't work in the same binary with HA enabled #1979

Closed
yanweiguo opened this issue Jan 6, 2021 · 9 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@yanweiguo
Copy link
Contributor

Expected Behavior

I can set more than one defaulting webhook controller in the same binary with leader election (HA) enabled. Same for configmaps and validation webhook.

Actual Behavior

If more than one defaulting webhook controllers are set in the same binary with leader election (HA) enabled, only one of the controllers would reconcile as expected.

Steps to Reproduce the Problem

  1. In knative serving webhook main.go, set up another defaulting.NewAdmissionController for Istio MutatingWebhookConfiguration
  2. Deploy knative serving and the Istio MutatingWebhookConfiguration.
  3. Check Istio MutatingWebhookConfiguration and the serving MutatingWebhookConfiguration, you will fine that only one of them gets reconciled.

Additional Info

This is because for defaulting webhook controllers, they share the same lease. The lease name is defaultingwebhook: https://github.com/knative/pkg/blob/master/webhook/resourcesemantics/defaulting/controller.go#L77. However when setting up leader election, each controller uses different resource lock ID: https://github.com/knative/pkg/blob/master/leaderelection/context.go#L114. So only one of them will actually got set up successfully.

In Knative serving autoscaler, KPA controller and metrics controller share the same lease as well. But they have the same resource lock ID: https://github.com/knative/serving/blob/master/cmd/autoscaler/main.go#L269. The callbacks of those two controllers will be triggered when starting leading the lease.

@yanweiguo
Copy link
Contributor Author

Maybe the simplest way to fix is to use the name of MutatingWebhookConfiguration for the names of reconcile queue and the lease as well, which is a parameter of defaulting.NewAdmissionController. Same for validation and configmaps webhook controllers.

cc @mattmoor

@mattmoor
Copy link
Member

mattmoor commented Jan 6, 2021

set up another defaulting.NewAdmissionController for Istio

Don't do that. 😜

In mink I'm running a defaulting webhook for all of serving, eventing, and tekton as a single endpoint (I used to run a lot more!). Pass all the types to a single defaulting webhook.

Is there a legitimate reason to be running multiple defaulting webhooks vs. a single with all the types...?

@yanweiguo
Copy link
Contributor Author

Is there a legitimate reason to be running multiple defaulting webhooks vs. a single with all the types...?

I don't see one. Currently in our downstream branch we just copy both the MutatingWebhookConfiguration in serving repo and net-istio repo. I guess it is for easy sync with upstream.

If using single defaulting webhook is the right way to go, I'll do that. Thanks.

@mattmoor
Copy link
Member

mattmoor commented Jan 6, 2021

Yeah, that's the way things were intended to work: https://github.com/mattmoor/mink/blob/f6f4dc092ce14818642cacf87ce26ac892d5a392/cmd/webhook/defaulting.go#L57

LMK if you need any pointers / help.

@yanweiguo
Copy link
Contributor Author

@Cynocracy
Copy link
Contributor

Cynocracy commented Jan 6, 2021

Is 'the webhook may not be running on cluster but needs to negotiate with other clusters for ownership' a good reason?

In an advanced case where the user does not want their on-cluster control plane availability to be tied to their own data plane, I could see having multiple webhooks (some with different failure domains) that would all need to negotiate ~independently.

I could see this popping up if a binary was responsible for multiple clusters and different sets of resources in each of them

@yanweiguo
Copy link
Contributor Author

Talked with @mattmoor in slack, there are a few solutions:

  1. Make multiple defaulting webhooks work with HA. Defaulting webhook is not designed to support multiple instances. However I'm not sure why (we didn't talk about it). @mattmoor could you explain?
  2. Add the filtering logic in https://github.com/knative-sandbox/net-istio/blob/master/pkg/defaults/deployment_defaults.go#L60 instead of depending on matchExpression. But it will become control plane of all deployments, including knative deployments.
  3. Implement a new webhook controller like the defaulting webhook and use it for those MutatingWebhookConfiguration who have the same matchExpression.
  4. Change the defaulting webhooks to support multiple admissionReviewVersions. Is it doable?

@dprotaso
Copy link
Member

dprotaso commented Mar 13, 2021

General Observations/Context

  1. Kubernetes lets you have multiple webhook entries in the validating/mutating resource
    1. Our webhook tooling only supports a single webhook entry per resource
    2. There's no explanation why i. is that way - it's been like this forever (initial serving commit)
  2. @markusthoemmes's recent PR Add smart handling of selectors in webhooks #1949 preserves webhook properties that we don't set (ie. object selectors)
  3. Two webhook controllers of the same class (ie. defaulting, validation, config, cert, psbinding) can't co-exist because they compete for the same lease.
    1. Controllers don't have this problem because they generally have unique names (component + work queue)
  4. We also have a bug where after the CRDs are created but before the webhook computes/applies rules invalid resources can be created

Comments on the Solutions:

Make multiple defaulting webhooks work with HA. Defaulting webhook is not designed to support multiple instances. However I'm not sure why (we didn't talk about it). @mattmoor could you explain?

I interpret this as there should only be a single registered webhook that uses apis.Defaultable to default a type.

Add the filtering logic in https://github.com/knative-sandbox/net-istio/blob/master/pkg/defaults/deployment_defaults.go#L60 instead of depending on matchExpression. But it will become control plane of all deployments, including knative deployments

I'm not a fan of this - is the goal to avoid deploying two webhook?

Implement a new webhook controller like the defaulting webhook and use it for those MutatingWebhookConfiguration who have the same matchExpression.

I talked about roughly this here: #1140 (comment)

This could work with multiple webhook registrations per resource. With Markus's change certain selectors won't be reconciled away.

Though, to my third point, should we continue to have this inconsistency between composing webhooks and controllers? Right now we compose webhooks like Matt mentioned here: #1979 (comment) while controllers are done by passing their constructor.

There are some differences though:

  1. webhooks require a callback path - right now it's unique to the webhook controller and duplicate path registrations fail
  2. conversion webhooks - there should only ever be one for a type
    1. I'm unsure if some of the other webhooks have such restrictions (ie. psbinding)
  3. if the webhook controllers map to N resources we need the ability to leader elect and thus would require unique lease names in the process

For 3) from: #1989 (comment)

It would work. FWIW, this is not only used for the queue name but also the lease name for lead election. By changing this, a lease object will be leaked after upgrading an existing cluster. Are we fine with this?

This is for #1979. @n3wscott @dprotaso are you fine with the solution to modify upstream code? Matt prefers solution 3 mentioned in #1979 (comment).

We've accidentally changed leases names before and leaked them - I think some thrashing of controllers would occur while the rollout happens. I haven't tested it.

  1. Change the defaulting webhooks to support multiple admissionReviewVersions. Is it doable?

Not sure what you mean - these admission versions are just the webhook envelope between the k8s api server <=> webhook. They're currently v1 & v1beta1

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
4 participants