Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCE] fix unsafe webhook vpa-webhook-config #6428

Closed
wants to merge 4 commits into from

Conversation

britdm
Copy link
Contributor

@britdm britdm commented Jan 8, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR cleans up the unsafe webhook vpa-webhook-config deployed on GCE. The namespaces kube-system and kube-node-lease are considered system-managed. See unsafe webhooks.

Which issue(s) this PR fixes:

Fixes cowboysysop/charts#578

Special notes for your reviewer:

To keep the namespaceSelector optionally empty, the variable definition can be wrapped in an if statement that verifies that the provider is gce.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 8, 2024
Copy link

linux-foundation-easycla bot commented Jan 8, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 8, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @britdm!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 8, 2024
@k8s-ci-robot k8s-ci-robot added area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 8, 2024
@kwiesmueller
Copy link
Member

/assign @sophieliu15

@sophieliu15
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2024
@sophieliu15
Copy link

@kwiesmueller Could you take a look to decide whether to approve or not?

@@ -65,6 +65,15 @@ func selfRegistration(clientset *kubernetes.Clientset, caCert []byte, namespace,
sideEffects := admissionregistration.SideEffectClassNone
failurePolicy := admissionregistration.Ignore
RegisterClientConfig.CABundle = caCert
namespaceSelector := metav1.LabelSelector{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really matter, considering it runs with failurePolicy: ignore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with failurePolicy: ignore, GKE will still create an error for you and display it in your cluster, so I guess it makes sense to ignore those namespaces. This is a breaking change, though, if for some reason people would auto-scale stuff in kube-system, it would stop working for them. Therefore, I think we should only do this conditionally and hide it behind some configuration parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but if GKE raises error when it shouldn't that is a problem with GKE, not with the components that happen to trigger that behavior. I think we should have better reasoning than "Cloudprovider UI displays an error".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with both of you.
It's breaking and should be flag gated.

Not sure if it's about the UI, but agree that if it's really a cloud provider specific UI thing that shouldn't force us to fix it. I thought webhooks against system namespaces are considered "dangerous" either way.
Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?

I think a flag that lets you specify namespaces that should be ignored could be one option.
I'd like to make the default safer, prevent people from accidentally locking their clusters, but as the failurePolicy is ignore. that should be fine, right?

Copy link
Member

@alvaroaleman alvaroaleman Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?

Its uncommon for projects to provide this IME. If it is provided, it is probably at least semi-common to use, because webhooks require you to generate certs and not everyone has some other mechanism in their cluster available to do that or knows how to do it through helm and friends.

Copy link
Contributor

@voelzmo voelzmo Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alvaroaleman

I think we should have better reasoning than "Cloudprovider UI displays an error".

Absolutely agreed! I guess in this specific case, it is hard to say "the UI is just wrong, so if you want to get rid of this error, go talk to your cloud provider". As @kwiesmueller pointed out, even with failurePolicy: ignore, user deployed webhooks in system namespaces should be avoided. In our internal k8s platform we do a very similar thing and flag these things for users.

@kwiesmueller

I think a flag that lets you specify namespaces that should be ignored could be one option.

This is a double-edged feature. Users can fix validation errors like mentioned above themselves. However, at the same time it makes it easier/possible to configure something that leads to an endless eviction loop: If you only configure the webhook to ignore a certain namespace but happen to have objects under VPA control in those namespaces, Updater will permanently evict your workload.
Ideally, we wouldn't make it easy or even possible for people to create these situations and shoot themselves in the foot. It reminds me a bit of the discussions we had around making the labelSelector configurable in #5660.

I imagine the best compromise here is probable a namespace denylist, similar to how the VPA components currently have the concept of an allowlist for namespaces they work in (parameter vpa-object-namespace). The configuration parameter is the the same on all 3 components and you could argue that a user setting this parameter on one component should take care of setting it on the other two components with the same values. Denylist and allowlist should be made exclusive to each other, and hopefully we found a good solution for everyone. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failurePolicy: ignore, user deployed webhooks in system namespaces should be avoided

Where did they point that out? And could you elaborate on the rationale behind that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quote from the GKE docs

If a webhook is intercepting any resources in system-managed namespaces, or certain types of resources, GKE considers this unsafe and recommends that you update the webhooks to avoid intercepting these resources.

The same is mentioned in the official k8s docs, but maybe with a bit softer wording

If your admission webhooks don't intend to modify the behavior of the Kubernetes control plane, exclude the kube-system namespace from being intercepted using a namespaceSelector.

I guess that failurePolicy: ignore will help in most cases, but there are still scenarios where a failing webhook can stop k8s from working properly, so that's probably why even with this failurePolicy webhooks are still flagged as problematic. However, I see that you were involved in the discussion around this, so you probably have a better understanding of this than I do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the webhook returning a valid http response that is not parseable or such would be an issue. The other issue I am aware of as it caused an outage for us is if you validate anything related to leader election, because the client-side timeouts there are way below the 30 seconds, so all of these requests can fail before the webhook times out, which can brick a cluster.

However, we are only validating create pod requests, i think that should be fine.

@@ -65,6 +65,15 @@ func selfRegistration(clientset *kubernetes.Clientset, caCert []byte, namespace,
sideEffects := admissionregistration.SideEffectClassNone
failurePolicy := admissionregistration.Ignore
RegisterClientConfig.CABundle = caCert
namespaceSelector := metav1.LabelSelector{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with both of you.
It's breaking and should be flag gated.

Not sure if it's about the UI, but agree that if it's really a cloud provider specific UI thing that shouldn't force us to fix it. I thought webhooks against system namespaces are considered "dangerous" either way.
Honestly I'm surprised to see this selfRegistration thing, was not aware that it's something people do. @alvaroaleman do you know if that's common?

I think a flag that lets you specify namespaces that should be ignored could be one option.
I'd like to make the default safer, prevent people from accidentally locking their clusters, but as the failurePolicy is ignore. that should be fine, right?

@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 20, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 20, 2024
@britdm
Copy link
Contributor Author

britdm commented Mar 20, 2024

I have added the additional flag webhookIgnoredNamespaces that could be used to grab a list of comma delimited namespaces passed using --webhook-ignore-namespaces.

For this case, passing --webhook-ignore-namespaces kube-system,kube-node-lease should add those two namespaces to the namespaceSelector. The flag will allow the default empty string and also the option to specify namespaces to ignore.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alvaroaleman, britdm
Once this PR has been reviewed and has the lgtm label, please assign jbartosik for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@voelzmo
Copy link
Contributor

voelzmo commented Mar 21, 2024

Hey @britdm thanks for the new change! I think, however, we need to have a paramater on all three VPA components, otherwise it can easily happen that users end up in an endless eviction loop: If you have VPA objects in kube-system, recommender will provide a recommendation, updater will evict and the admission-controller ignores this namespace and won't adapt the requests.

I tried to sketch this out in an earlier, longer comment

I imagine the best compromise here is probable a namespace denylist, similar to how the VPA components currently have the concept of an allowlist for namespaces they work in (parameter vpa-object-namespace). The configuration parameter is the the same on all 3 components and you could argue that a user setting this parameter on one component should take care of setting it on the other two components with the same values. Denylist and allowlist should be made exclusive to each other.

So I think the missing changes here are

  • rename parameter to something like ignored-vpa-object-namespaces, such that it doesn't reference the webhook and has some resemblence to the already existing vpa-object-namespace
  • throw an error, if both parameters are provided. It should only be possible to provide one of ignored-vpa-object-namespaces and vpa-object-namespace
  • move documentation for this parameter to the top README
  • Bonus: add a test that makes sure this flag works, similar to what we already have for recommenders with different names

@adrianmoisey
Copy link
Member

Hey @britdm, are you planning to continue this PR?
If not, do you mind if I try work on it?

@britdm
Copy link
Contributor Author

britdm commented Apr 25, 2024

Hey @britdm, are you planning to continue this PR? If not, do you mind if I try work on it?

Sure, I have not yet had time to clean up my local copy with changes for this PR, and would appreciate the help to move it forward. One caveat I found with updating the other VPA components is that the conditional statement used to check if both variables are used in the admission controller would need slight modifications to work for the other components.

@adrianmoisey
Copy link
Member

I've continued this work over here: #6788

@voelzmo
Copy link
Contributor

voelzmo commented May 15, 2024

/close
In favor of #6788

@k8s-ci-robot
Copy link
Contributor

@voelzmo: Closed this PR.

In response to this:

/close
In favor of #6788

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exclude namespaces kube-system and kube-node-lease in mutatingwebhookconfiguration vpa-webhook-config
7 participants