-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-3015: Node-level topology #3293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up @danwinship! Added some initial thoughts after a quick run through.
This KEP adds a new topology hint, to tell kube-proxy that a Service | ||
is expected to have an endpoint on every node most of the time, and so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of hint is this? Do we need a per-endpoint hint or can kube-proxy just derive that from the Service or some other higher level concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was vague here because it's not fully figured out below yet.
I talk below about kube-controller-manager deriving it, but not kube-proxy, because kube-proxy doesn't watch Pods, so it has less information. I guess kube-proxy could still validate "has endpoints on every node / almost every node", but it can't validate "was created by a DaemonSet". (But maybe we don't care about that? The criteria for deciding when to use this are also undecided.)
I guess one nice thing about having the EndpointSlice controller write to an actual EndpointSlice field is that it makes it very clear when the feature is being used, whereas if kube-proxy decided by itself, you wouldn't be able to tell from looking at the Service/Endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a central controller in kube-controller-manager is the best place to determine if this feature should be enabled. I'm just hoping we can find a way to indicate that as centrally as possible for a Service so we're not updating every individual endpoint whenever this changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember with zone-based hints, one of the biggest concerns was to avoid flapping between an enabled and disabled state (requiring updates to all relevant EPS). I think whatever we do here should have some kind of detail written about how we'd avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, some hysteresis, like: if < 90% of nodes have an endpoint, disable; if >= 95%, enable; if >= 90% && < 95%, leave it as it is set.
|
||
### Non-Goals | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not written here, but my interpretation of this KEP is that we're not trying to prevent a local endpoint from being overloaded. This is essentially just an extension of ExternalTrafficPolicy=Local, except when a local endpoint is lacking, there's a fallback to Topology Hints or default routing logic. I think it's important to call out somewhere that it's very possible for individual endpoints to be overloaded with this approach and that this KEP does not aim to solve that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially just an extension of ExternalTrafficPolicy=Local
Well, no. The discussion in #3016 seemed to me to be saying that local traffic policy implies a semantic distinction between local and remote delivery, and the fact that it is more efficient is really just a side effect. Whereas this is solely about efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that's true, I think some/many users choose to use ExternalTrafficPolicy for the efficiency, the implementation details are secondary. My guess is that the usage of this feature would end up being significantly higher than ExternalTrafficPolicy=local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written, this can't serve as a replacement for externalTrafficPolicy: Local
, because it doesn't coordinate with the LoadBalancer parts, so you'd have no way to force incoming connections to end up on the right node.
We could adjust things so LoadBalancer services do try to take advantage of node-level topology, but if we were going to do that, it seems like really we should just go back to externalTrafficPolicy: PreferLocal
.
I guess you could make the argument that for external traffic policy, this is a semantic thing, not just a topology thing; you don't just want to change which endpoints get picked, you also want to change how the cloud load balancers work.
OTOH, for internal traffic, the desired behavior for the primary use case (DNS) really is just "better topology". OTOOH, I am still totally unconvinced that there is any real use case for internalTrafficPolicy: Local
as opposed to node-level topology.
So maybe the real answer is:
- add
externalTrafficPolicy: PreferLocal
- add node-level topology for services like DNS
- drop
internalTrafficPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(although also, it was argued that ProxyTerminatingEndpoints mostly gets rid of the need for externalTrafficPolicy: PreferLocal
anyway, by fixing the awkward bits of externalTrafficPolicy: Local
so you can use it just as an optimization if you want.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written, this can't serve as a replacement for
externalTrafficPolicy: Local
(...which is obviously a bug since one of the User Stories talks about doing that)
(This would imply that a service could not use both node-level | ||
topology and zone-level topology. Another possibility would be to have | ||
a new annotation for node-level topology.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my comment above, I think the best approach is to integrate this more closely with topology hints because distributions may need to change depending on how endpoints are distributed. Having separate paths in EndpointSlice controller for either approach could get rather complicated.
- The pods that make up the service endpoints all have an | ||
`OwnerReference` pointing to the same `DaemonSet`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty restrictive but could be a reasonable starting point. May have messy transition states if a DaemonSet is replaced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really that messy; when the first few nodes lose their endpoints for the old service, they would automatically fall back to using Cluster semantics. After enough nodes lost endpoints, the EndpointSlice controller would turn off node-level topology for the service and all the remaining nodes would flip to Cluster semantics. There would be less total churn than there would be without node-level topology, since the first few endpoint deletions would be ignored by most of the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this criterion at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment in a collapsed thread. (TL;DR: "service that is intentionally deployed to every node" is semantically different from "service which coincidentally happens to be deployed to every node" in relevant ways.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's something I have seen:
Cluster has 3 node-pools: small, medium, large, with different sized machines. They run 3 different daemonsets with node selectors for each pool, and different resource requests on each.
"Prefer local" would be ideal but fails this criterion.
0594c9e
to
3a6ca47
Compare
I have not fully read this revised proposal yet, but considering this and kubernetes/kubernetes#110714 at the same time, maybe I am just wrong about this not being an ITP value. It certainly is the easiest API. E.g. That handles the Service-side API (the service producer expressing how they want the service to be consumed, which still feels icky but maybe OK for special cases. I guess kube-proxy would consider that before even looking at hints? |
Having written the KEP both ways now, it feels more topology-like than traffic-policy-like to me. Especially, you can't use the feature properly just by enabling it on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not hate this. The goal with topology was always to build new heuristics and get better at optimizing, so this feels in-line with that.
Instead of setting it to `Auto`, the user could set it to `Node` to | ||
indicate node-level topology rather than zone-level. | ||
|
||
(This would imply that a service could not use both node-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or set it to "NodeZone" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, isn't it important that both be possible, so version-skewed (old) proxies which don't know this new hint would fall back on regular topology, and REALLY old ones would fall back on all slices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~3.5-4 years ago we talked about this being an array ["node", "zone", "*"]
or whatever, enabling the user to state their preferences. We abandoned that because we decided fully automatic is much better, I guess we're circling back around? Or I guess I see automatic mode below.
- The pods that make up the service endpoints all have an | ||
`OwnerReference` pointing to the same `DaemonSet`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this criterion at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this @danwinship! Sorry I'd lost track of it!
This KEP adds a new topology hint, to tell kube-proxy that a Service | ||
is expected to have an endpoint on every node most of the time, and so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember with zone-based hints, one of the biggest concerns was to avoid flapping between an enabled and disabled state (requiring updates to all relevant EPS). I think whatever we do here should have some kind of detail written about how we'd avoid that.
- Allow configuring a service so that connections will be delivered to | ||
a local endpoint when possible, and a remote endpoint if not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any mechanism to prevent a local endpoint from getting overloaded? With zonal hints that was already a significant risk, but with a single node the risk seems to increase exponentially. If there isn't a good way to mitigate this risk, I think this would need to be opt-in until we had some kind of feedback loop that enabled us to fallover to other endpoints when the local one had reached capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design is really all based on the DNS user story (since that's the only user story we currently have).
In the case of a CoreDNS pod on every node, the assumption is that no endpoint will ever get overloaded; no one ever does that much DNS (unless they're doing some sort of DoS attack in which case all bets are off). Although kube-proxy is normally responsible for trying to prevent endpoints from getting overloaded, we are saying that for this kind of service, it does not have that responsibility.
I guess I can add some hedging to the description to allow for the possibility that in the future, node-level topology might send traffic to other nodes even when there is an endpoint on this node, if it thinks that that would result in a faster reply. But I didn't have any intention of trying to implement that behavior now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference I see between "policy" and "topology" is that policy is a user-provided expression of intent and "topology" is a collection of heuristics that we apply (eventually without the user saying anything). If we're ever going to auto-enable this heuristic, how can we build in more safety? If this is not going to be auto-enabled ever, is it topology or policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there must be more use cases for a PreferLocal approach here, but I don't know what those are. If we're focusing on DNS, it seems like something that is going to be configured once per cluster and therefore a perfect candidate for opt-in functionality and not trying to automatically detect this scenario.
I think a good dividing line here is that topology has tried to include some form of logic to determine if it's safe to enable. We're trying to guess what would be a reasonable approach for a user. If we believe that topology in Kubernetes will ever have a feedback loop that allows us to understand when an endpoint has reached capacity, same-node routing would also be in scope. Unfortunately that seems to require very significant architectural changes.
What is proposed here does not include anything like that. Instead it is simply "route to local endpoints if they exist, otherwise revert to default behavior". That seems like it's closer in nature to xTP=Local and it seems to fully solve the DNS use case. This all leads me to think that it may be worth having the following options available:
- xTP=Local
- iTP=PreferLocal
Since xTP doesn't have any kind of fallback mechanism for local traffic, we're really just interested in what would happen for internal traffic. My theory is that it would be reasonable for iTP to fallback to topology hints if it was enabled, or "cluster" if not.
To summarize, I think it's probably simplest to consider this policy and not try to build any advanced logic into this. That means we'd be agreeing that:
- The simple implementation of routing to local endpoints if present is sufficient for the use cases we are aware of
- This is not something we'll ever want to enable by default
- This will generally only be enabled on one Service per cluster
- It is unlikely that we'll have a feedback loop in Kubernetes that allows us to safely spillover when a local endpoint reaches capacity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thought here, I've gotten at least one feature request to just have an equivalent to the "PreferNode" concept here but for zone. Essentially "if any endpoints are in my local zone, route to them". Based on the criteria above, that also does not seem to fit into topology aware hints but would be more similar to whatever this "PreferNode" concept becomes, if we want to support it. In both cases, the users are saying they can ensure that endpoints are distributed appropriately, they just want a way to ensure traffic stays within the bounds they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR: I was previously against "PreferLocal" as a policy, but the discussion has softened me on that.
- Deprecate `internalTrafficPolicy`? It's clear that the DNS use case | ||
given in the Internal Traffic Policy KEP is not actually a good use | ||
case for Internal Traffic Policy, because no one wants the behavior | ||
of "I'd rather have DNS requests get dropped than have them go to | ||
another node". But without the DNS use case, it's not clear that | ||
there's really a strong argument for Internal Traffic Policy at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the main use-cases for this KEP? Is this primarily beneficial when the cost to reach another node within the same zone is too high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I want DNS to be fast"
on nodes with only a single endpoint would presumably get | ||
overloaded. But this KEP does not attempt to figure out any new | ||
behavior there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finding some ways to mitigate risk of overloading endpoints feels like it should be part of graduation criteria for this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the goal was to only change the service behavior in cases where "overloading endpoints" is not a concern.
LoadBalancer services. Consensus is that [Proxy Terminating | ||
Endpoints] should solve the problems that made | ||
`externalTrafficPolicy: Local` unreliable for some cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the same apply for internalTrafficPolicy: Local
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. (It's important to remember that iTP:Local and eTP:Local are really totally different things; they have the same name because the very-low-level kube-proxy implementation details are the same, but at a high level they are used in completely different ways.)
The problem with eTP:Local is a race condition where the cloud LB is technically violating the contract of the eTP:Local service by sending traffic to a node that reports that it has no live endpoints. We can't easily fix the LB to not have that race condition, so PTE fixes the problem by making endpoints continue working even after the node claims to have no endpoints.
The problem with iTP:Local is not a race condition, and doesn't involve anyone doing anything illegal; the problem is that even with PTE (or terminationGracePeriodSeconds: 0
), there is still a gap between when the old endpoint exits and the new one starts serving, and the iTP:Local service is not available during that gap. (Assuming only 1 endpoint per node. But since we were already assuming above that the service will be consistently under-loaded with 1 endpoint per node, it would be very wasteful to have 2 endpoints per node...)
- The service has an endpoint on every node, or at least "almost | ||
every" node. (eg, no more than N or N% of nodes are missing an | ||
endpoint). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On some occasions I've seen clusters where there are multiple fairly isolated node pools, where a daemonset or set of Pods runs exclusively on a specific node pool. That kind of situation seems like it could benefit from this PreferLocal kind of approach, but this automatic approach does not appear to be compatible. Is there any merit in an approach that simply routes traffic to a local endpoint if it exists and otherwise falls back to default routing?
That's actually already how zonal topology hints work (https://github.com/kubernetes/kubernetes/blob/3af1e5fdf6f3d3203283950c1c501739c21a53e2/pkg/proxy/topology.go#L178).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't automatically recognize that node-level topology would be useful in cases like that, though, because just from looking at the Service/DaemonSet, you can't tell how the clients of that service are going to be distributed, so you can't tell if node-level topology would help or hurt.
So that's an argument in favor of the manual approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the criteria is more like:
For those nodes which DO have at least 1 endpoint, if they all have approximately the same number of endpoints (maybe some %age threshold so we don't end up saying "1 is approximately 2") then prefer-local is allowed.
or even:
Any node with at least the mean number of endpoints can use a prefer-local strategy, and any node with less than the mean must fall back on something else.
Does that work? Does it make this mode more amenable to being automatic?
It might add a new field to EndpointSliceHints (which would be unset | ||
in most EndpointSlices). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the more I think there's no real value in adding a new hints field since we already have the nodeName for each endpoint. If we were going to try to be clever and assign endpoints proportionally to nodes (a la zonal hints), maybe we should add a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have "automatic" node-level topology then the new Hint is useful because it lets you see when we decided to use node-level topology for a service. But if we stick with "manual" then you don't need that.
|
||
As with Topology Aware Hints, Node-Level Topology would only apply to | ||
connections with `Cluster` traffic policy, because | ||
`internalTrafficPolicy: Local` semantically _requires_ local delivery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we've discussed this before, but I can't remember the background. What was the rationale for this approach over iTP == PreferLocal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this thread on the PreferLocal PR. We came to a sort of consensus that "traffic policy" describes semantic changes to the service ("source IP must be preserved", "traffic must not be delivered to a remote endpoint"), while PreferLocal was really more like an optimization hint, which made it feel more like topology than traffic policy. This was a somewhat vague argument though, and you could make a case that we're overfitting the definition to match the existing iTP/eTP values...
After writing out the new KEP, I decided that it also feels more like topology than traffic policy for another reason: namely, the fact that it requires the deployer to make sure that endpoints end up in the right places. With externalTrafficPolicy: Local
, the endpoints can end up anywhere and the service still functions exactly the same way; the location of the endpoints matters at a low level, but it doesn't matter to the deployer or the user of the service. But with internalTrafficPolicy: PreferLocal
/ node-local-topology, the location/distribution of the endpoints does matter; you can't put 10 DNS endpoints on 1 node and none anywhere else and expect it to work well as a PreferLocal service. So that also feels more like topology (where, eg, you have to make sure you have endpoints in every zone if you want to be able to use zone-level topology usefully).
OTOH, internalTrafficPolicy: Local
also requires you to explicitly think about endpoint location/distribution, so there's another counterargument...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't put 10 DNS endpoints on 1 node and none anywhere else and expect it to work well as a PreferLocal service.
Well, akshually... why not? If you assume 10 DNS replicas is sufficient for the whole cluster, you get 1 node which gets fast local access (which is would anyway) and N-1 nodes that get slower remote access (which they would anyway). PreferLocal == Cluster in that case.
I think the only really problematic scenario is when a node has at least 1 replica, but those local replicas are insufficient to satisfy the node's pods. E.g. a 10 node cluster has 30 replicas of some service. 1 node gets 1 replica, 2 nodes get 4 replicas, and 7 nodes get 3 replicas. ON AVERAGE every node has 3, but that node with 1 replica could easily get swamped, while never considering its neighbors who have 3 or 4 replicas. So there IS still a problem, but I think it's different than you characterized
OK, trying to summarize the discussion:
So... (spitballing...) what if we added (Alternate: have a And maybe Relative to current Topology-Aware Hints, this loses some of the trying-to-balance-things stuff, so we'd need to incorporate that too... |
For what it's worth, the very initial design of topology aware routing was like this, except the field was called |
The "prefer node local" case was one of the primary drivers of this since you can do something like:
But we figured the "prefer node local" case was more of a special case that could be codified separately, which is why we created the |
So for the node-level case it's easy to ensure that your endpoints are distributed correctly because you just use a DaemonSet. Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology... |
I'd agree with this approach. From a purely practical perspective, setting hints only makes sense when the value of the hint could be different than where the endpoint is. For example, with Topology Aware Hints, if there are 3 endpoints all in one zone, but nodes are equally distributed across 3 zones, each of those endpoints will be assigned to a different zone with a hint. We could theoretically take a similar approach for preferNode. In an example where there are 3 endpoints on 1 node and 2 nodes without any endpoints, we could assign 1 endpoint to each node with hints. I don't think that approach really has any value though, and it certainly is not solving the DNS use case @danwinship identified here. That means the only reasonable approach appears to be "if endpoint(s) exist on the same node, forward there, otherwise fall back to default routing across cluster." To me, that approach does not seem to gain any value from being tied to the hints name or architecture. Similarly, we could have the same approach for zone, but populating hints again feels unnecessary since they're not required for a proxy implementation to make an endpoint filtering decision.
I'm not sure of the specifics here, but agree that we need to look into what's possible on the scheduling side. I know for topology aware hints, I need to spend some time talking with sig-scheduling and sig-autoscaling to see if we can try to provision new Pods with zone distribution taken into account. |
3a6ca47
to
1bd3801
Compare
Hey everyone, I've added this to the sig-network agenda for tomorrow, hopefully that can help us reach some kind of consensus on the path forward. |
Based on discussion yesterday at the sig-network meeting, it sounds like this KEP may not be as necessary as we originally thought. Combining that with the idea that we likely want to expose additional heuristics for topology aware hints, I've created #3765 to explore what that would look like. |
To clarify that: I said that I don't need it for the reasons I originally needed it (and if it's going to be more about topology than about "prefer-local"-ness then I'm not sure I'm the right person to be pushing it forward because I don't know a lot about how people use topology). The reason I wanted KEP-3015 originally was to get rid of two local patches in OpenShift's kube-proxy:
However, (1) |
@danwinship what hack did you use for CoreDNS and why did you have to use one at all? |
So there's the "preferlocal" thing, but also, it improves performance to disable conntrack for DNS packets (since there are lots of them, and you don't actually need them to be conntracked), and... hm, pretty sure there's one more but I forget what it is right now... |
@danwinship how did you disable conntrack for DNS packets? |
iptables |
I see 3 paths here.
I don't like ignoring problems, so (1) is disfavored. I was originally against (2) on an argument of "preference is not policy", but I have reconsidered that, and I think I was wrong. I am less excited by (3) the more we dig into it. I suspect we will want the ability to make a different choice for internal and external traffic, which leads me back to (2). I wrote some thoughts in #3765 (review) but here's a little more. I see topology mostly as an implementation-centric optimization (which should be by default, but that's a different argument). Our in-the-box implementation of endpoint-selection is a controller which produces EndpointSlices and includes some optimization hints for the data plane. Is that the only possible implementation? No. We know our implementation has limits, and until/unless we fix those, it's not really that hard to make a better implementation - and I won't get in the way of projects that want to do that. Let's consider a hypothetical alternate endpoint-selection implementation which does dynamic feedback and automatic topology optimization. It will consume our Service definitions. What do we think a user would expect / accept:
So, given point 1 (must respect If we do option 2 ( If we do option 3 ( So, to the people who want this feature: Which makes more sense to you? @LAMRobinson @ialidzhikov ^^^^^^^^^^^^^^^^^^^^^ TL;DR: If we add "Prefer..." to xTP, it's a rule that implementations must follow. If we implement it through topology hints, then it is a suggestion, rather than a rule. Edit: followup question. If we defined "PreferLocal" as "Always use an endpoint on this node, if possible" and leave room in "if possible" to include "I know this endpoint is overloaded", would that be passable? For the built-in impl, it would choose the same-node endpoint 100% of the time, but it leaves room for other impls to be smarter. I'm mostly trying this on for size - trying to figure out how to give alternate implementations as much freedom as possible. As an aside: EPSlice sort of serves as both input (user-provided) and as output (controller-provided), |
Current policies were already difficult to define, and I think that there are still some gaps, but if Kubernetes networking keeps growing and adding more complex network scenarios I feel that we are going to hit contradictions or "hard to explain" things soon with the traffic policies.
It seems that the biggest demand here is from people that require to have certain control, however, most people wants a "make my Services cheap by default, don't use inter-zone traffic if you can, but give me high availability" Current implementation solves the later, but fails at small scale because of the guardrails of the heuristics. I would like to experiment with an out-of-tree topology implementation first, to gather feedback before adding anything to Service. For this, we have the signaling/control-plane problem and the data-plane/forwarding problem , agree topology hints are kind of "this is my recommendation", but is what we have know, we can start experimenting with them #3765 (comment) |
Spent some time chatting with @thockin about this, I think we're trying to find a really tough balance here:
I think what most people actually want is to "keep traffic as close as possible, as long as I'm not overloading endpoints." Unfortunately we don't have any visibility into the overloading part right now, and it would be very complex to add that. With that background, I want to ensure we're leaving as much flexibility as possible with any approach here so we're not locking ourselves into supporting hints or a strict interpretation of "PreferLocal|Zone".
I think @thockin's point above is important, but it's hard to draw the right line here. I'd argue that rule names that being with "Prefer" should be sufficiently flexible to enable different implementations. For example:
Today only 1 is possible, but I wouldn't want the spec to be so tight that it rules out 2. Finally, I've been trying to fit hints into this view of the world. In my opinion, it should be a way to mitigate the risk involved in the first PreferZone approach described above. So the documentation would say something like this: Setting TrafficPolicy to "PreferZone" will ensure that traffic is routed to endpoints within the same zone if they're available. Note that this may increase the risk of overloading local endpoints. Enabling Topology Aware Hints can help mitigate that risk... This would mean that instead of offering |
You're talking like selectorless services still exist, but we basically killed them to close a CVE. |
With either So I'd say that traffic policy doesn't conflict with topology, it's just that it filters the set of endpoints before topology does, such that if you have both traffic policy and topology on a service, the topology algorithm may end up seeing only a single endpoint, or a set of endpoints that are all topologically equal.
In some cases, topology is money, and it seems like the user ought to be able to express non-implementation-dependent preferences for keeping traffic within fiscally-relevant topological borders, even if doing otherwise would give slightly better performance. (And other users might want to explicitly express the opposite, because time is money too, and maybe a fast inter-zone request will cost the company less in the end than a slower intra-zone request.)
I think that depends on why the user did "PreferLocal". Maybe they are worried about network saturation. In that case, they might prefer that the request to the PreferLocal service be answered more slowly by an overloaded local endpoint, rather than being passed off to another node and potentially slowing the network down for everybody. |
Tim and Rob both seem to be assuming that everyone always wants/needs/expects kube-proxy to prevent endpoint overload, which I think is equivalent to arguing that load distribution is the only reason anyone ever has a service with multiple endpoints. But that's not true. Eg, in the case of CoreDNS on every node, no one does that because they think their cluster needs N CoreDNS pods where N happens to equal the number of nodes. They do it because they think the advantage of having a local DNS pod on every node outweighs the disadvantage of running more CoreDNS pods than they really "need". Likewise, people may have a service that they know a single replica can handle the load for, but they want redundancy or topological spread. Maybe we need a way to explicitly rank the traffic policies / endpoint selection heuristics that a service wants. (Warning: half-baked thoughts ahead.) Then people could say either |
What most people want is to not have to think about this -- for the default to be good enough that the risk of further manual optimization outweighs the benefits. If I follow your premise, you're saying that xTP="PreferZone" should be allowed to violate the "always choose same zone if possible" which turns it into a (possibly stronger) form of what we have with topology. Do you think that xTP=Cluster would be barred from doing the same? If not, what's the difference between them? I have deep-running concern that "do the right thing" doesn't end up behind an opt-in (non-default) setting.
Default RBAC prevents USERS from writing slices - it doesn't prevent other systems (think multi-cluster). It's important that MCS and topology work together :) |
I like @danwinship's mental model here. It feels like there's a nice combination of options to give users here if we offer both Traffic Policy and Topology Hints.
I like this as well, one of my current bug bears with TP is the lack of "fail open" from local. TP options:
So we could do I'd love this, though I admit it's just topology keys isn't it? I guess fixed to a set of known/cluster attributes.. It does feels like all these features compliment each other really well and we can set the defaults as they are today such that most people just use hints and don't think, but those that want more control can have it, whilst still being able to also use the hints algo if they want to try and avoid overload within their defined service scopes. |
Yes, it is :) Also, we need to keep in mind that we are encumbered by past decisions, so converting the existing policies back into a list is not really feasible. I'll refer back to #3293 (comment) which wants user opinions on how a hypothetical alternate control-plane would work in the presence of these design options. |
I think Option 2? I'm happy with Topology (SameNode/SameZone/SameRegion*) being secondary to xTP in priority and it feels like that makes more sense given that xTP is a "fail shut" behaviour. Essentially I don't mind at all if this is "opt in/existing behaviour takes priority" - I just want a way to enforce the desired topology behavior over the default (xTP=cluster) without the black box safety checks of the current Hint auto algorithm implementation. |
This is the thing I struggle with. We're considering adding a policy which we KNOW is dangerous because the current heuristic isn't strong enough. Let me ask another loaded question: If the hints logic was retooled so that it was active for you, and resulted in staying in-zone 50% of the time, would that be enough? How about 85% ? How about 98% ? I'm trying to get a sense of whether there's a functional-correctness aspect of this, or just a performance optimization. |
These are good questions. I think it’s naturally hard to give a % answer because the nature of the cause of the non-functional state matters more to me than the strict availability SLA concerns. For example these behaviours currently result in 0%:
Similarly there are situations where specific services having topology outages are significantly worse than just overloading the service, for example I would rather a client receive a slow response than sending that traffic out of zone (cloud zone costs / WAN link limitations) As I’ve mentioned somewhere else, in most cases I don’t want fail shut here (as in I want to fall back eventually to “any instance anywhere”) - which means presumably there is some perfect heuristic that’ll meet the above situations and result in perfect behaviour. As such it feels like iterating on this solution is the right thing to do, though I’d like to see an extension of the xTP=PreferLocal to include Zonal and Region. That said the lack of control around the choices the algorithm makes remains a key frustration about this feature. I want to be able to define what overload means on a per-service basis rather than have the algorithm incorrectly guess for me. |
It's unlikely that any heuristic we build will do exactly what you think you want. I do not think that we want to expose all the knobs and parameters that go into the logic we've got. At least not now, not yet. I'm tentatively OK with adding other heuristics and letting people choose among them, but I emphasize that these are intentionally a bit vague, because we reserve the right to keep trying to do better by default. |
Related to the "proxy trying to holistically determine endpoint load" thing: persistent clients + periodically-updated servers = massive imbalance: kubernetes/kubernetes#37932 (comment) |
Which now asks: should we close this KEP in favor of heuristics, until such time as that proves unworkable? |
One-line PR description: new version of KEP-3015, replacing the old "
PreferLocal
traffic policy" idea with "node-level topology"Issue link: PreferSameNode Traffic Distribution (formerly PreferLocal traffic policy / Node-level topology) #3015
Other comments: See previous discussion of the original
PreferLocal
idea in KEP-3015: PreferLocal traffic policy #3016. We agreed there that this would make more sense as topology than as traffic policy, hence this PR./sig network
/cc @robscott @andrewsykim @thockin