-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4559 Redesigning Kubelet probes #4558
base: master
Are you sure you want to change the base?
Conversation
|
||
Regardless of what else we do, we will deprecate the | ||
`TCPSocketAction.Host` and `HTTPGetAction.Host` fields, and allow them | ||
to be blocked by Pod Security admission. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why this is not the case today, cc @tallclair @liggitt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related kubernetes/kubernetes#81129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, IMO we should just make this change. I think it is stretching the scope of what PSA covers a bit, but it's not that far.
|
||
### API Changes to Fix the SSRF | ||
|
||
Regardless of what else we do, we will deprecate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we really deprecate them? I know users that depend on this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have examples?
If we know what people are using it for, we can try to come up with an API that allows the "sane" use cases without allowing it to be used for an attack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.tilt.dev/api.html#api.http_get_action
https://github.com/tilt-dev/tilt/blob/3359b56077d1bff23de1d179cb99431eefdd3c35/internal/tiltfile/api/__init__.py#L1385
https://stackoverflow.com/questions/63082309/liveness-probe-of-one-pod-via-another
https://stackoverflow.com/questions/75811572/executing-readiness-httpget-readiness-probe-failed-no-such-host
I remember some user in kubernetes/kubernetes explaining they used the probe to query an external http service but I can not find it ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.tilt.dev/api.html#api.http_get_action https://github.com/tilt-dev/tilt/blob/3359b56077d1bff23de1d179cb99431eefdd3c35/internal/tiltfile/api/__init__.py#L1385
That just shows that they're re-exporting the (full) API, not that anyone is actually using it.
https://stackoverflow.com/questions/63082309/liveness-probe-of-one-pod-via-another
"pod B depends on pod A working so I want pod B's liveness probe to check pod A (via a service externalIP)".
That's a little goofy... Clearly it would be architecturally nicer (if more complicated) to make pod B implement a proper health check itself.
Also, this health check, as written, depends on the existence of the NetworkPolicy hole (since it eventually results in a kubelet-to-pod-A connection) so even if we allowed it, it wouldn't work with AdminNetworkPolicy...
That's a little more plausible since they're using a service IP that resolves to the pod they're probing, rather than an entirely unrelated pod. This still requires the NetworkPolicy hole though, and this goes back to the "are probes about server readiness or network connectivity?" question
I remember some user in kubernetes/kubernetes explaining they used the probe to query an external http service but I can not find it ...
That seems even more dubious than the someone-else's-externalIP example.
At any rate, all of these could be implemented via exec probes if it was no longer possible to do them via http probes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be clear, I'm in favor of this, I also opened the issue some time ago to move the TCP and HTTP probes inside the Pods, and I think that they are different ways of doing this keeping the functionality ... is just that realistically speaking the API and the semantics are v1 and changing them will mean that is a backwards incompatible change :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to work, k8s will need knowledge of the network namespace. It sounds like we want to execute into the netns and run the health checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and there are other tech details in the doc suggested, as reusing existing CRI exec rpc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to work, k8s will need knowledge of the network namespace. It sounds like we want to execute into the netns and run the health checks?
There are multiple ideas suggested. And as noted in a comment below, although this KEP was just filed this week, I actually wrote all of it several months ago, so the existing text doesn't consider KNI at all.
2. Kubelet could perform probes inside pods without needing any | ||
changes to the runtime, by attaching an ephemeral container to each | ||
pod, containing some `kubelet-probe` binary that would run the | ||
probes and report the results back to kubelet (by some means). Most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this "by some means" is the most important detail here ...
deprecate the existing probe types and create new ones with the new | ||
semantics. | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this is just socializing the problem and opening the discussion, without knowing the design details will be hard to evaluate this KEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In particular, we need input from SIG Node on which of the ideas here do and don't make sense for kubelet and CRI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's on me, let me start opening up some discussions here with sig-node
You need to open an issue to be the "home" for the KEP, and then the number of that issue becomes the KEP number (replacing the "NNNN" in the current branch) |
report back as part of the pod status whether the pod is live, | ||
ready, etc, according to its probes. This might allow for | ||
better optimization of probes on the CRI side. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I originally wrote this, we have started talking about KNI, and the ideas here that mention CRI could alternatively be done in terms of KNI. (Assuming that KNI is going to happen in a useful time frame, this would actually make a lot of sense, since it's another case of putting more of the details of the pod network into one place rather than having them be scattered across multiple places.)
It was in the original NetworkPolicy design spec but AFAIK never made it into any official documentation. |
ah I see thanks Dan, opened #4559; will link to this KEP |
/retitle KEP-4559 Redesigning Kubelet probes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great overview. Might be good to prioritize the problem statements. E.g. Maybe SSRF is P0, Performance/timeouts is P1, Dual-stack is P2, NetPol is P4
<<[/UNRESOLVED]>> | ||
``` | ||
|
||
- Continue to backward-compatibly support all existing probes by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's almost certainly no path to a default behavior that breaks compat. Whatever we do here HAS TO be additive and optional, or the breaking change needs to be controlled by admins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These use case with NodePort services is interesting
kubernetes/kubernetes#89898 (comment) as it has a lot of network implications
- Continue to backward-compatibly support all existing probes by | ||
default (at least for a while). | ||
|
||
- Deprecate the [`TCPSocketAction.Host`] and [`HTTPGetAction.Host`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a KEP of its own, but I see the use in an umbrella KEP to sort out all the issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair also agreed that we should add Pod Security Admission support for blocking these. Does this even need a KEP or can we just call it "fixing kubernetes/kubernetes#99425"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this with a gate, and that usually means a KEP, to kep release-team and docs and comms in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I open a new KEP issue to track this? OR should it be that we first need to provide a way for users to be able to do the "External services" use case still a.k.a we finish this KEP before opening doors for the deprecation KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#125271 is the PR I am starting out with which will mark these fields as part of gated policy
so maybe the whole deprecation needs a new KEP afterall because I think this KEP might take more time to land. @thockin / @danwinship : correct me if I'm wrong about the paper work
any probing pod would then be able to connect to any pod in the | ||
cluster. | ||
|
||
- Alternatively, the "probing pod" could be a static pod with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If every pod had an additional node- or link-local address, a whole lot of things would be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we've been talking about "pods attached only to a secondary network, not the cluster default pod network" in OpenShift, and one tricky bit is: how does DNS work for them? The current thinking is that pods that "aren't attached to the cluster default pod network" will still actually be attached to the cluster default pod network, just with routing/policy set up to only allow it to be used in very limited ways (like DNS).
But having a special "infra network" attached to every pod would be another way of doing that same thing...
The trick is finding a CIDR range to use; we can't use 169.254.0.0/16
because many existing clusters expect pods to be able to connect to the 169.254.169.254
outside the node. IPv6 link-local addresses would work, but we can't make the infra network be IPv6-only because then we couldn't do health checks for IPv4-only pods over it.
It's still an interesting idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aslo link-local addresses are commonly used by different things to provide this sort of mechanism, if following that path we should make clear what is the range reserved for kubernetes probes ...
IPv6 has this concept of address scopes https://datatracker.ietf.org/doc/html/rfc4007#section-6 that solves this problem, as we can create the probe address scope range ... if we just could ensure IPv6 is enable on all clusters, adding this IPv6 capability to pods will certainly solve this problem with very low effort
appropriate to the runtime. If this is not simply a wrapper around | ||
exec probes, then this implies that CRI will need to understand | ||
each type of probe supported by Kubernetes (i.e., it will need to | ||
support every feature of `HTTPGetAction`, `TCPSocketAction` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super worried about this - probes have had about 1 major change in 10 years, gRPC.
a UDP probe) would require adding support to both `Pod`/kubelet and | ||
to CRI. | ||
|
||
- If CRI is going to need to understand that much of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I buy this - Adding breadth to the API surface is one thing, but requiring the CRI impl to keep state and run periodic work seems like maybe too far?
Moving pod probes into the pod network namespace will run into the | ||
standard difficulties with using kernel namespaces from golang code. | ||
|
||
The `setns()` syscall in Linux changes the namespace of a thread; this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that, if we do this, it should be CRI who takes resonsibility. Calling setns
seems like a layering violation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but containerd and cri-o are both written in Go too, so we're just handing the footguns to them.
(This section is kind of vague but the point is mostly "don't forget about this when deciding which solution is best".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compartmentalizing it is still useful. Kubelet should not know about netns directly. CRIs do. They can do things like open a UNIX socket and fork/exec() a pool of helper processes which can jump between netns'es. to do probes. It's a little multi-process system of its own, but it's not THAT complicated :)
what these "sufficiently weird things" would be. One example would be | ||
having the pod recognize the node IP as a source IP so that it can | ||
respond differently to kubelet probes than it would to a "real" | ||
request.) It would be safer (though more annoying) to instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Annoying for us" is better than "broken for users". :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should we say we definitely want to go with
- Add new probe types (with new semantics and no
.Host
support) - Deprecate the old probe types
- Allow using PSA to block Pods that use the old probe types
rather than
- Change the semantics of the existing probe types
- Deprecate the
.Host
fields of the existing probe types - Allow using PSA to block Pods that use probes with non-empty
.Host
fields
?
(Maybe the first case should also say "block Pods that use the old probes with non-empty .Host
fields" rather than "block Pods that use the old probes"? I guess I was sort of abusing PSA as a way to help admins force users into the new probe semantics, but that's not really security so maybe it's wrong to use PSA for that. I guess the admin could write their own admission controller if they wanted to fully deprecate the old probes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow using PSA to block Pods that use the old probe types
PSA or simply ValidatingAdmissionPolicy
- it should solve pretty much all such "don't use this field" use-cases.
If we are going to do probes from in-pod network, then I think yes. Depending on timeline, we might still want to deprecate .Host
and recommend VAP
for admins who want to disallow .Host
right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have started kubernetes/kubernetes#125271 to use PSA for "don't use this field" be careful part.
We should also clarify and document the behavior in HTTP wrt redirects. |
Following redirects in general reopens the SSRF problem, so I guess the options are:
|
I think we follow local redirects today but not remote. I recall this being a CVE or something a long time ago, but I forget the exact resolution. We should document the behavior better. |
|
- **Exec Probe Performance (?)**: [Exec probes are much, much slower than | ||
network probes]. Apparently this slowness is a somewhat inevitable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unaware of this slowness problem.
Is there more context for me to go read up on it? An issue perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see below that "Figure out details of exec probe performance problem" is listed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The brackets in the text here indicate that that's a link, with the actual URL done "footnote-style" a bit further down in the text. In this case, the link is to a blog post, "Exec Probes: A Story about Experiment and Relevant Problems". I don't have much more context than that, other than my guesses in this paragraph. (Though also, I don't think it was ever confirmed that this is a problem with exec probes in all CRI runtimes, or if it's basically just a containerd bug.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add a note to that effect that we have not had any specific issues opened in the past around this but listing the problem here based on the post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need someone from Node to evaluate this or we have the risk of operating with biased information, @samuelkarp can you help us to understand better this performance problem with Exec probes? is this a general and unsolvable problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exec with a new task is not as expensive as running a container..but still, it's not an inexpensive operation either.. and the usual problem is there are a plethora of exec probes that are not cycle optimized because of the default timeouts and repeat duration values.. And when the system gets stressed the probe failures just make it worse.
We could probably figure out more optimizations... sure that's worthy of trying even if just for non-probe exec. Would be interesting to see if there is a difference between crio/containerd or any other improvements since the last time we checked.
Though, I held out hope for a different probe api that did not lean on/overload exec and the streaming api.
already more constrained than the TCP and HTTP probes anyway (and | ||
I don't have a good suggestion for renaming them). | ||
|
||
### Kubelet/Runtime Changes to Fix the NetworkPolicy Hole |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Kubelet could perform probes inside pods without needing any | ||
changes to the runtime, by attaching an ephemeral container to each | ||
pod, containing some `kubelet-probe` binary that would run the | ||
probes and report the results back to kubelet (by some means). Most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the by some means here is the key
container to the pod at construction time, via CRI calls, without | ||
exposing the container to the kubernetes API, like it does with the | ||
pause / sandbox / infrastructure container. (In fact, the probes | ||
could even be done as part of the existing pause / sandbox / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be removed in containerd 2.0, pause sandbox is an implementation detail and we can not depend on it
than being a new ephemeral container attached each time kubelet | ||
needed to perform a probe. Kubelet could perhaps attach such a | ||
container to the pod at construction time, via CRI calls, without | ||
exposing the container to the kubernetes API, like it does with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be a mismatch on resources consumption also, and the whole lifecycle of the pod will be compromised, if something happens with the probe container do we block the pod? how can user debug something if is hidden? it has a lot of implications we need to consider carefully
could even be done as part of the existing pause / sandbox / | ||
infrastructure container.) | ||
|
||
3. If we can improve the performance of exec probes, then kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This topic needs some assessment from Node people https://github.com/kubernetes/enhancements/pull/4558/files#r1620722068
- Figure out details of exec probe performance problem. | ||
|
||
- Figure out if network probe performance is still a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
- Allow dual-stack / wrong-single-stack probing, either based on | ||
information in the probe definition (e.g. `ipFamily: IPv6` or | ||
`ipFamilyPolicy: RequireDualStack`) or by always doing "[Happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not say that happy eyeballs is a successful protocol , it also had to go through a new version https://datatracker.ietf.org/doc/html/rfc8305 and is very unreliable and causes a lot of issues in production environments
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tssurya The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The recent push addresses comments till |
Running up against the clock - how much of this is slated for 1.31? |
I would say only the PSA bits where we want to restrict users can land for 1.31, I have added an item to ask sig-auth if that need's a separate KEP. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Redesigning Kubelet Probes
PS: I am running with what @danwinship did danwinship@72333728bb so all credit to him. I am willing to drive this KEP and learn stuff along the way (that is my only role here).
So the main idea here is that the kubelet probes to pods are always expected to work in kubernetes. However when we add network policies (V1) the implicit deny kicks in which means the probes won't work. We went with "NetworkPolicy doesn't apply to Kubelet probes" - this point was originally covered in the NetPol KEP: https://github.com/kubernetes/design-proposals-archive/blob/main/network/network-policy.md#behavior however it never made its way into official documentation and stays as something ecosystem takes as the standard. So all networkpolicy implementations are currently adding allow rules for the probes to work well and creating that hole. SIG-Network-Policy-API would like to tackle this problem better by working with the SIG-Node team
This is also my first PR in enhancement repo, please be patient as I try to get all the stuff correctly done.