KEP-3015: Node-level topology #3293

danwinship · 2022-05-03T21:04:39Z

One-line PR description: new version of KEP-3015, replacing the old "PreferLocal traffic policy" idea with "node-level topology"
Issue link: PreferSameNode Traffic Distribution (formerly PreferLocal traffic policy / Node-level topology) #3015
Other comments: See previous discussion of the original PreferLocal idea in KEP-3015: PreferLocal traffic policy #3016. We agreed there that this would make more sense as topology than as traffic policy, hence this PR.

/sig network
/cc @robscott @andrewsykim @thockin

robscott

Thanks for writing this up @danwinship! Added some initial thoughts after a quick run through.

robscott · 2022-05-03T23:57:06Z

keps/sig-network/3015-node-level-topology/README.md

+This KEP adds a new topology hint, to tell kube-proxy that a Service
+is expected to have an endpoint on every node most of the time, and so


What kind of hint is this? Do we need a per-endpoint hint or can kube-proxy just derive that from the Service or some other higher level concept?

I was vague here because it's not fully figured out below yet.

I talk below about kube-controller-manager deriving it, but not kube-proxy, because kube-proxy doesn't watch Pods, so it has less information. I guess kube-proxy could still validate "has endpoints on every node / almost every node", but it can't validate "was created by a DaemonSet". (But maybe we don't care about that? The criteria for deciding when to use this are also undecided.)

I guess one nice thing about having the EndpointSlice controller write to an actual EndpointSlice field is that it makes it very clear when the feature is being used, whereas if kube-proxy decided by itself, you wouldn't be able to tell from looking at the Service/Endpoints.

I agree that a central controller in kube-controller-manager is the best place to determine if this feature should be enabled. I'm just hoping we can find a way to indicate that as centrally as possible for a Service so we're not updating every individual endpoint whenever this changes.

I remember with zone-based hints, one of the biggest concerns was to avoid flapping between an enabled and disabled state (requiring updates to all relevant EPS). I think whatever we do here should have some kind of detail written about how we'd avoid that.

yeah, some hysteresis, like: if < 90% of nodes have an endpoint, disable; if >= 95%, enable; if >= 90% && < 95%, leave it as it is set.

robscott · 2022-05-03T23:59:23Z

keps/sig-network/3015-node-level-topology/README.md

+
+### Non-Goals
+
+N/A


Not written here, but my interpretation of this KEP is that we're not trying to prevent a local endpoint from being overloaded. This is essentially just an extension of ExternalTrafficPolicy=Local, except when a local endpoint is lacking, there's a fallback to Topology Hints or default routing logic. I think it's important to call out somewhere that it's very possible for individual endpoints to be overloaded with this approach and that this KEP does not aim to solve that problem.

This is essentially just an extension of ExternalTrafficPolicy=Local

Well, no. The discussion in #3016 seemed to me to be saying that local traffic policy implies a semantic distinction between local and remote delivery, and the fact that it is more efficient is really just a side effect. Whereas this is solely about efficiency.

Although that's true, I think some/many users choose to use ExternalTrafficPolicy for the efficiency, the implementation details are secondary. My guess is that the usage of this feature would end up being significantly higher than ExternalTrafficPolicy=local.

As currently written, this can't serve as a replacement for externalTrafficPolicy: Local, because it doesn't coordinate with the LoadBalancer parts, so you'd have no way to force incoming connections to end up on the right node.

We could adjust things so LoadBalancer services do try to take advantage of node-level topology, but if we were going to do that, it seems like really we should just go back to externalTrafficPolicy: PreferLocal.

I guess you could make the argument that for external traffic policy, this is a semantic thing, not just a topology thing; you don't just want to change which endpoints get picked, you also want to change how the cloud load balancers work.

OTOH, for internal traffic, the desired behavior for the primary use case (DNS) really is just "better topology". OTOOH, I am still totally unconvinced that there is any real use case for internalTrafficPolicy: Local as opposed to node-level topology.

So maybe the real answer is:

add externalTrafficPolicy: PreferLocal

add node-level topology for services like DNS

drop internalTrafficPolicy

(although also, it was argued that ProxyTerminatingEndpoints mostly gets rid of the need for externalTrafficPolicy: PreferLocal anyway, by fixing the awkward bits of externalTrafficPolicy: Local so you can use it just as an optimization if you want.)

As currently written, this can't serve as a replacement for externalTrafficPolicy: Local

(...which is obviously a bug since one of the User Stories talks about doing that)

keps/sig-network/3015-node-level-topology/README.md

robscott · 2022-05-04T00:12:59Z

keps/sig-network/3015-node-level-topology/README.md

+(This would imply that a service could not use both node-level
+topology and zone-level topology. Another possibility would be to have
+a new annotation for node-level topology.)


Based on my comment above, I think the best approach is to integrate this more closely with topology hints because distributions may need to change depending on how endpoints are distributed. Having separate paths in EndpointSlice controller for either approach could get rather complicated.

keps/sig-network/3015-node-level-topology/README.md

robscott · 2022-05-04T00:15:14Z

keps/sig-network/3015-node-level-topology/README.md

+  - The pods that make up the service endpoints all have an
+    `OwnerReference` pointing to the same `DaemonSet`.


This seems pretty restrictive but could be a reasonable starting point. May have messy transition states if a DaemonSet is replaced.

It's not really that messy; when the first few nodes lose their endpoints for the old service, they would automatically fall back to using Cluster semantics. After enough nodes lost endpoints, the EndpointSlice controller would turn off node-level topology for the service and all the remaining nodes would flip to Cluster semantics. There would be less total churn than there would be without node-level topology, since the first few endpoint deletions would be ignored by most of the nodes.

Why do we need this criterion at all?

See this comment in a collapsed thread. (TL;DR: "service that is intentionally deployed to every node" is semantically different from "service which coincidentally happens to be deployed to every node" in relevant ways.)

Here's something I have seen:

Cluster has 3 node-pools: small, medium, large, with different sized machines. They run 3 different daemonsets with node selectors for each pool, and different resource requests on each.

"Prefer local" would be ideal but fails this criterion.

thockin · 2022-07-07T20:23:33Z

I have not fully read this revised proposal yet, but considering this and kubernetes/kubernetes#110714 at the same time, maybe I am just wrong about this not being an ITP value. It certainly is the easiest API.

E.g. internalTrafficPolicy: PreferLocal -> prefer same node if possible, else prefer same zone, else cluster.

That handles the Service-side API (the service producer expressing how they want the service to be consumed, which still feels icky but maybe OK for special cases. I guess kube-proxy would consider that before even looking at hints?

danwinship · 2022-07-09T16:42:09Z

Having written the KEP both ways now, it feels more topology-like than traffic-policy-like to me. Especially, you can't use the feature properly just by enabling it on the v1.Service; you have to also take steps to ensure that your endpoints are distributed in a useful way across the cluster, such that routing connections to local endpoints will actually be the right thing. (ie, you have to either deploy the endpoints as a DaemonSet so they're available everywhere, or you need to use selectors / affinity / taints to ensure the clients and endpoints end up together.)

thockin

I do not hate this. The goal with topology was always to build new heuristics and get better at optimizing, so this feels in-line with that.

thockin · 2022-09-29T23:17:59Z

keps/sig-network/3015-node-level-topology/README.md

+Instead of setting it to `Auto`, the user could set it to `Node` to
+indicate node-level topology rather than zone-level.
+
+(This would imply that a service could not use both node-level


Or set it to "NodeZone" ?

Actually, isn't it important that both be possible, so version-skewed (old) proxies which don't know this new hint would fall back on regular topology, and REALLY old ones would fall back on all slices?

~3.5-4 years ago we talked about this being an array ["node", "zone", "*"] or whatever, enabling the user to state their preferences. We abandoned that because we decided fully automatic is much better, I guess we're circling back around? Or I guess I see automatic mode below.

thockin · 2022-09-29T23:22:39Z

keps/sig-network/3015-node-level-topology/README.md

+  - The pods that make up the service endpoints all have an
+    `OwnerReference` pointing to the same `DaemonSet`.


Why do we need this criterion at all?

robscott

Thanks for the work on this @danwinship! Sorry I'd lost track of it!

robscott · 2022-09-30T00:03:45Z

keps/sig-network/3015-node-level-topology/README.md

+This KEP adds a new topology hint, to tell kube-proxy that a Service
+is expected to have an endpoint on every node most of the time, and so


I remember with zone-based hints, one of the biggest concerns was to avoid flapping between an enabled and disabled state (requiring updates to all relevant EPS). I think whatever we do here should have some kind of detail written about how we'd avoid that.

robscott · 2022-09-30T00:07:20Z

keps/sig-network/3015-node-level-topology/README.md

+- Allow configuring a service so that connections will be delivered to
+  a local endpoint when possible, and a remote endpoint if not.


Will there be any mechanism to prevent a local endpoint from getting overloaded? With zonal hints that was already a significant risk, but with a single node the risk seems to increase exponentially. If there isn't a good way to mitigate this risk, I think this would need to be opt-in until we had some kind of feedback loop that enabled us to fallover to other endpoints when the local one had reached capacity.

The design is really all based on the DNS user story (since that's the only user story we currently have).

In the case of a CoreDNS pod on every node, the assumption is that no endpoint will ever get overloaded; no one ever does that much DNS (unless they're doing some sort of DoS attack in which case all bets are off). Although kube-proxy is normally responsible for trying to prevent endpoints from getting overloaded, we are saying that for this kind of service, it does not have that responsibility.

I guess I can add some hedging to the description to allow for the possibility that in the future, node-level topology might send traffic to other nodes even when there is an endpoint on this node, if it thinks that that would result in a faster reply. But I didn't have any intention of trying to implement that behavior now.

The main difference I see between "policy" and "topology" is that policy is a user-provided expression of intent and "topology" is a collection of heuristics that we apply (eventually without the user saying anything). If we're ever going to auto-enable this heuristic, how can we build in more safety? If this is not going to be auto-enabled ever, is it topology or policy?

It feels like there must be more use cases for a PreferLocal approach here, but I don't know what those are. If we're focusing on DNS, it seems like something that is going to be configured once per cluster and therefore a perfect candidate for opt-in functionality and not trying to automatically detect this scenario.

I think a good dividing line here is that topology has tried to include some form of logic to determine if it's safe to enable. We're trying to guess what would be a reasonable approach for a user. If we believe that topology in Kubernetes will ever have a feedback loop that allows us to understand when an endpoint has reached capacity, same-node routing would also be in scope. Unfortunately that seems to require very significant architectural changes.

What is proposed here does not include anything like that. Instead it is simply "route to local endpoints if they exist, otherwise revert to default behavior". That seems like it's closer in nature to xTP=Local and it seems to fully solve the DNS use case. This all leads me to think that it may be worth having the following options available:

xTP=Local

iTP=PreferLocal

Since xTP doesn't have any kind of fallback mechanism for local traffic, we're really just interested in what would happen for internal traffic. My theory is that it would be reasonable for iTP to fallback to topology hints if it was enabled, or "cluster" if not.

To summarize, I think it's probably simplest to consider this policy and not try to build any advanced logic into this. That means we'd be agreeing that:

The simple implementation of routing to local endpoints if present is sufficient for the use cases we are aware of

This is not something we'll ever want to enable by default

This will generally only be enabled on one Service per cluster

It is unlikely that we'll have a feedback loop in Kubernetes that allows us to safely spillover when a local endpoint reaches capacity

One more thought here, I've gotten at least one feature request to just have an equivalent to the "PreferNode" concept here but for zone. Essentially "if any endpoints are in my local zone, route to them". Based on the criteria above, that also does not seem to fit into topology aware hints but would be more similar to whatever this "PreferNode" concept becomes, if we want to support it. In both cases, the users are saying they can ensure that endpoints are distributed appropriately, they just want a way to ensure traffic stays within the bounds they want.

FTR: I was previously against "PreferLocal" as a policy, but the discussion has softened me on that.

robscott · 2022-09-30T00:08:34Z

keps/sig-network/3015-node-level-topology/README.md

+- Deprecate `internalTrafficPolicy`? It's clear that the DNS use case
+  given in the Internal Traffic Policy KEP is not actually a good use
+  case for Internal Traffic Policy, because no one wants the behavior
+  of "I'd rather have DNS requests get dropped than have them go to
+  another node". But without the DNS use case, it's not clear that
+  there's really a strong argument for Internal Traffic Policy at all.


What are the main use-cases for this KEP? Is this primarily beneficial when the cost to reach another node within the same zone is too high?

"I want DNS to be fast"

robscott · 2022-09-30T00:09:45Z

keps/sig-network/3015-node-level-topology/README.md

+  on nodes with only a single endpoint would presumably get
+  overloaded. But this KEP does not attempt to figure out any new
+  behavior there.


Finding some ways to mitigate risk of overloading endpoints feels like it should be part of graduation criteria for this KEP.

For now, the goal was to only change the service behavior in cases where "overloading endpoints" is not a concern.

robscott · 2022-09-30T00:10:44Z

keps/sig-network/3015-node-level-topology/README.md

+  LoadBalancer services. Consensus is that [Proxy Terminating
+  Endpoints] should solve the problems that made
+  `externalTrafficPolicy: Local` unreliable for some cases.


Would the same apply for internalTrafficPolicy: Local?

No. (It's important to remember that iTP:Local and eTP:Local are really totally different things; they have the same name because the very-low-level kube-proxy implementation details are the same, but at a high level they are used in completely different ways.)

The problem with eTP:Local is a race condition where the cloud LB is technically violating the contract of the eTP:Local service by sending traffic to a node that reports that it has no live endpoints. We can't easily fix the LB to not have that race condition, so PTE fixes the problem by making endpoints continue working even after the node claims to have no endpoints.

The problem with iTP:Local is not a race condition, and doesn't involve anyone doing anything illegal; the problem is that even with PTE (or terminationGracePeriodSeconds: 0), there is still a gap between when the old endpoint exits and the new one starts serving, and the iTP:Local service is not available during that gap. (Assuming only 1 endpoint per node. But since we were already assuming above that the service will be consistently under-loaded with 1 endpoint per node, it would be very wasteful to have 2 endpoints per node...)

robscott · 2022-09-30T00:15:25Z

keps/sig-network/3015-node-level-topology/README.md

+  - The service has an endpoint on every node, or at least "almost
+    every" node. (eg, no more than N or N% of nodes are missing an
+    endpoint).


On some occasions I've seen clusters where there are multiple fairly isolated node pools, where a daemonset or set of Pods runs exclusively on a specific node pool. That kind of situation seems like it could benefit from this PreferLocal kind of approach, but this automatic approach does not appear to be compatible. Is there any merit in an approach that simply routes traffic to a local endpoint if it exists and otherwise falls back to default routing?

That's actually already how zonal topology hints work (https://github.com/kubernetes/kubernetes/blob/3af1e5fdf6f3d3203283950c1c501739c21a53e2/pkg/proxy/topology.go#L178).

We can't automatically recognize that node-level topology would be useful in cases like that, though, because just from looking at the Service/DaemonSet, you can't tell how the clients of that service are going to be distributed, so you can't tell if node-level topology would help or hurt.

So that's an argument in favor of the manual approach.

What if the criteria is more like:

For those nodes which DO have at least 1 endpoint, if they all have approximately the same number of endpoints (maybe some %age threshold so we don't end up saying "1 is approximately 2") then prefer-local is allowed.

or even:

Any node with at least the mean number of endpoints can use a prefer-local strategy, and any node with less than the mean must fall back on something else.

Does that work? Does it make this mode more amenable to being automatic?

robscott · 2022-09-30T00:17:39Z

keps/sig-network/3015-node-level-topology/README.md

+It might add a new field to EndpointSliceHints (which would be unset
+in most EndpointSlices).


The more I think about this, the more I think there's no real value in adding a new hints field since we already have the nodeName for each endpoint. If we were going to try to be clever and assign endpoints proportionally to nodes (a la zonal hints), maybe we should add a field.

If we have "automatic" node-level topology then the new Hint is useful because it lets you see when we decided to use node-level topology for a service. But if we stick with "manual" then you don't need that.

robscott · 2022-09-30T00:19:14Z

keps/sig-network/3015-node-level-topology/README.md

+
+As with Topology Aware Hints, Node-Level Topology would only apply to
+connections with `Cluster` traffic policy, because
+`internalTrafficPolicy: Local` semantically _requires_ local delivery


I know we've discussed this before, but I can't remember the background. What was the rationale for this approach over iTP == PreferLocal?

See this thread on the PreferLocal PR. We came to a sort of consensus that "traffic policy" describes semantic changes to the service ("source IP must be preserved", "traffic must not be delivered to a remote endpoint"), while PreferLocal was really more like an optimization hint, which made it feel more like topology than traffic policy. This was a somewhat vague argument though, and you could make a case that we're overfitting the definition to match the existing iTP/eTP values...

After writing out the new KEP, I decided that it also feels more like topology than traffic policy for another reason: namely, the fact that it requires the deployer to make sure that endpoints end up in the right places. With externalTrafficPolicy: Local, the endpoints can end up anywhere and the service still functions exactly the same way; the location of the endpoints matters at a low level, but it doesn't matter to the deployer or the user of the service. But with internalTrafficPolicy: PreferLocal / node-local-topology, the location/distribution of the endpoints does matter; you can't put 10 DNS endpoints on 1 node and none anywhere else and expect it to work well as a PreferLocal service. So that also feels more like topology (where, eg, you have to make sure you have endpoints in every zone if you want to be able to use zone-level topology usefully).

OTOH, internalTrafficPolicy: Local also requires you to explicitly think about endpoint location/distribution, so there's another counterargument...

you can't put 10 DNS endpoints on 1 node and none anywhere else and expect it to work well as a PreferLocal service.

Well, akshually... why not? If you assume 10 DNS replicas is sufficient for the whole cluster, you get 1 node which gets fast local access (which is would anyway) and N-1 nodes that get slower remote access (which they would anyway). PreferLocal == Cluster in that case.

I think the only really problematic scenario is when a node has at least 1 replica, but those local replicas are insufficient to satisfy the node's pods. E.g. a 10 node cluster has 30 replicas of some service. 1 node gets 1 replica, 2 nodes get 4 replicas, and 7 nodes get 3 replicas. ON AVERAGE every node has 3, but that node with 1 replica could easily get swamped, while never considering its neighbors who have 3 or 4 replicas. So there IS still a problem, but I think it's different than you characterized

danwinship · 2022-10-03T14:03:11Z

OK, trying to summarize the discussion:

We're still not totally in agreement about what the difference between "traffic policy" and "topology" is, and whether this would be better as the former or the latter. (But I can easily close this PR and reopen the iTP: PreferLocal one if we like that better.)
We can possibly autodetect that node-local topology would be useful in the DNS case ("there's an endpoint deployed by DaemonSet to every single node"), but (a) there are probably no other cases that would be as easy-to-detect as that; and (b) DNS is deployed by the admin or installer so it's not a big deal to require manual configuration for that case anyway, and (c) there are other cases where the clients and servers are both restricted to a subset of nodes where the feature would be useful if it was enabled but kubernetes can't plausibly figure out that it would be useful to enable. So manual configuration is better than automatic, at least for now.
This feature is part of a continuum that extends from "I want a Service available across multiple clusters, but traffic should stay within clusters if possible" (always true with multi-cluster services?) to "I want a Service available throughout a single cluster, but traffic should stay within zones if possible" (existing Topology-Aware Hints) to maybe "I want a Service on some/all nodes, but traffic should stay within subsets that I define if possible" (eg, "rack-local topology") (doesn't currently exist?), to "I want a Service on some/all nodes, but traffic should stay on the node it starts on if possible" (PreferLocal/Node-Level Topology).
This feature is kind of like internalTrafficPolicy: Local and kind of not like internalTrafficPolicy: Local...
- The same continuum of cluster-level / zone-level / user-defined-level / node-level distinctions could also theoretically exist for iTP:Local-style services (ie changing "if possible" to "or fail" in all examples above).

So... (spitballing...) what if we added service.Spec.TopologyLabels which is an array of labels, and if set, then kube-proxy will prefer that clients get routed to an endpoint on a node with the same value for all of those labels as the client's node. so that topologyLabels: ["kubernetes.io/metadata.name"] would imply node-level topology, and topologyLabels: ["topology.kubernetes.io/zone"] would be similar to service.kubernetes.io/topology-aware-hints: Auto. Or maybe rather than an array where all the labels have to match, it could be an array where first it tries to match the first label, and if it can't do that, then it falls back to trying to match the second label, etc. So then you can have "prefer local but fall back to zone".

(Alternate: have a Topology.k8s.io type that defines a kind of topology, and then you can just say service.Spec.Topology: ["Node", "Rack", "Zone"] (or topology: NodeWithZoneFallback ?) referring to the Topology objects that provide full definitions.)

And maybe internalTrafficPolicy: Local could become internalTrafficPolicy: RequireTopology meaning, whatever the TopologyLabels say, that's a requirement rather than just a preference.

Relative to current Topology-Aware Hints, this loses some of the trying-to-balance-things stuff, so we'd need to incorporate that too...

andrewsykim · 2022-10-03T14:09:54Z

So... (spitballing...) what if we added service.Spec.TopologyLabels which is an array of labels, and if set, then kube-proxy will prefer that clients get routed to an endpoint on a node with the same value for all of those labels as the client's node. so that topologyLabels: ["kubernetes.io/metadata.name"] would imply node-level topology, and topologyLabels: ["topology.kubernetes.io/zone"] would be similar to service.kubernetes.io/topology-aware-hints: Auto. Or maybe rather than an array where all the labels have to match, it could be an array where first it tries to match the first label, and if it can't do that, then it falls back to trying to match the second label, etc. So then you can have "prefer local but fall back to zone".

For what it's worth, the very initial design of topology aware routing was like this, except the field was called service.spec.topologyKeys. We deleted this field in Alpha since in many cases the traffic is not safely distributed (hence the current version of topology aware routing).

andrewsykim · 2022-10-03T14:15:20Z

The "prefer node local" case was one of the primary drivers of this since you can do something like:

topologyKeys:
- "kubernetes.io/hostname"
- "*"

But we figured the "prefer node local" case was more of a special case that could be codified separately, which is why we created the internalTrafficPolicy field, with the assumption that most other cases for topologyKeys would be covered by topology aware routing automatically. At the time we thought topology aware routing would handle node-level topology too, but for good reason we opted out of that (at the time, topology aware routing used endpointslice subsetting which meant we would create an endpointslice per node). So given that context, I think I'm still in favor of internalTrafficPolicy: PreferLocal if node-level topology requires signifcant architectural changes to topology-aware routing (can't comment on this though as I'm not too familiar with topology-aware routing).

danwinship · 2022-10-03T14:45:08Z

So for the node-level case it's easy to ensure that your endpoints are distributed correctly because you just use a DaemonSet. Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology...

keps/sig-network/3015-node-level-topology/kep.yaml

keps/sig-network/3015-node-level-topology/README.md

robscott · 2022-10-04T01:18:00Z

So given that context, I think I'm still in favor of internalTrafficPolicy: PreferLocal if node-level topology requires signifcant architectural changes to topology-aware routing (can't comment on this though as I'm not too familiar with topology-aware routing.

I'd agree with this approach. From a purely practical perspective, setting hints only makes sense when the value of the hint could be different than where the endpoint is. For example, with Topology Aware Hints, if there are 3 endpoints all in one zone, but nodes are equally distributed across 3 zones, each of those endpoints will be assigned to a different zone with a hint.

We could theoretically take a similar approach for preferNode. In an example where there are 3 endpoints on 1 node and 2 nodes without any endpoints, we could assign 1 endpoint to each node with hints. I don't think that approach really has any value though, and it certainly is not solving the DNS use case @danwinship identified here.

That means the only reasonable approach appears to be "if endpoint(s) exist on the same node, forward there, otherwise fall back to default routing across cluster." To me, that approach does not seem to gain any value from being tied to the hints name or architecture. Similarly, we could have the same approach for zone, but populating hints again feels unnecessary since they're not required for a proxy implementation to make an endpoint filtering decision.

Maybe we need some way to easily configure other means of DaemonSet / Deployment distribution. (I think @thockin was talking about this somewhere.) Eg, a way to say "this Deployment must always have at least one endpoint in every zone". At that point, the Deployment/DaemonSet configuration could also be the opt-in for the Service-level topology; if you have a Deployment with the "deploy at least one endpoint to every zone" hint, then the endpoints controller can mark the EndpointSlices as having zone-level topology...

I'm not sure of the specifics here, but agree that we need to look into what's possible on the scheduling side. I know for topology aware hints, I need to spend some time talking with sig-scheduling and sig-autoscaling to see if we can try to provision new Pods with zone distribution taken into account.

robscott · 2023-01-19T00:24:57Z

Hey everyone, I've added this to the sig-network agenda for tomorrow, hopefully that can help us reach some kind of consensus on the path forward.

robscott · 2023-01-20T21:50:27Z

Based on discussion yesterday at the sig-network meeting, it sounds like this KEP may not be as necessary as we originally thought. Combining that with the idea that we likely want to expose additional heuristics for topology aware hints, I've created #3765 to explore what that would look like.

danwinship · 2023-01-21T19:43:37Z

it sounds like this KEP may not be as necessary as we originally thought

To clarify that: I said that I don't need it for the reasons I originally needed it (and if it's going to be more about topology than about "prefer-local"-ness then I'm not sure I'm the right person to be pushing it forward because I don't know a lot about how people use topology).

The reason I wanted KEP-3015 originally was to get rid of two local patches in OpenShift's kube-proxy:

We have an annotation that essentially turns externalTrafficPolicy: Local into externalTrafficPolicy: PreferLocal, to fix dropped connections to our ingress router during cluster upgrades
We have a special CoreDNS-specific hack that is basically equivalent to internalTrafficPolicy: PreferLocal but it only applies to CoreDNS

However, (1) ProxyTerminatingEndpoints addresses the first problem (in a totally different way), so we don't need externalTrafficPolicy: PreferLocal, and (2) discussion about the DNS use case made me realize we already have other DNS-specific hacks as well that wouldn't go away even if we had internalTrafficPolicy: PreferLocal, so we might want a more DNS-specific solution there (a la "Node Local DNS")

dudicoco · 2023-01-22T15:02:53Z

it sounds like this KEP may not be as necessary as we originally thought

To clarify that: I said that I don't need it for the reasons I originally needed it (and if it's going to be more about topology than about "prefer-local"-ness then I'm not sure I'm the right person to be pushing it forward because I don't know a lot about how people use topology).

The reason I wanted KEP-3015 originally was to get rid of two local patches in OpenShift's kube-proxy:

We have an annotation that essentially turns externalTrafficPolicy: Local into externalTrafficPolicy: PreferLocal, to fix dropped connections to our ingress router during cluster upgrades

We have a special CoreDNS-specific hack that is basically equivalent to internalTrafficPolicy: PreferLocal but it only applies to CoreDNS

However, (1) ProxyTerminatingEndpoints addresses the first problem (in a totally different way), so we don't need externalTrafficPolicy: PreferLocal, and (2) discussion about the DNS use case made me realize we already have other DNS-specific hacks as well that wouldn't go away even if we had internalTrafficPolicy: PreferLocal, so we might want a more DNS-specific solution there (a la "Node Local DNS")

@danwinship what hack did you use for CoreDNS and why did you have to use one at all?

danwinship · 2023-01-23T14:25:17Z

So there's the "preferlocal" thing, but also, it improves performance to disable conntrack for DNS packets (since there are lots of them, and you don't actually need them to be conntracked), and... hm, pretty sure there's one more but I forget what it is right now...

dudicoco · 2023-01-23T18:01:01Z

@danwinship how did you disable conntrack for DNS packets?

danwinship · 2023-01-23T18:50:46Z

iptables -j NOTRACK

thockin · 2023-01-27T20:05:39Z

I see 3 paths here.

Do not address this request
Add a new iTP/eTP policy or policies
Add a new heurisitic for topology

I don't like ignoring problems, so (1) is disfavored.

I was originally against (2) on an argument of "preference is not policy", but I have reconsidered that, and I think I was wrong.

I am less excited by (3) the more we dig into it. I suspect we will want the ability to make a different choice for internal and external traffic, which leads me back to (2).

I wrote some thoughts in #3765 (review) but here's a little more.

I see topology mostly as an implementation-centric optimization (which should be by default, but that's a different argument). Our in-the-box implementation of endpoint-selection is a controller which produces EndpointSlices and includes some optimization hints for the data plane. Is that the only possible implementation? No. We know our implementation has limits, and until/unless we fix those, it's not really that hard to make a better implementation - and I won't get in the way of projects that want to do that.

Let's consider a hypothetical alternate endpoint-selection implementation which does dynamic feedback and automatic topology optimization. It will consume our Service definitions. What do we think a user would expect / accept:

Does this implementation need to respect iTP/eTP, or does it disregard those EXPLICIT policy statements? These can be semantically important, so I think it has to respect them.
May this implementation turn on topology by default (even without service.kubernetes.io/topology-aware-routing being specified)? I think it should be allowed to.
Must this implementation respect service.kubernetes.io/topology-aware-routing: Disabled to deactivate topology? I am honestly not sure.
Must it make any guarantees about its topology results being the same as our implementation? Clearly no.
If we defined multiple in-the-box heuristics, would this implementation be expected to support them? Just like the previous, I think not.
If a user has a "manual endpoints" service, this implementation would consume those EndpointSlices; is it obligated to respect the hints fields? I think not, but not 100% confident.
If this implementation consumes our controller-produced EndpointSlices, is it obligated to respect the hints fields? I think no. It has its own topology implementation and it could ignore our hints altogether.

So, given point 1 (must respect iTP/eTP) and points 4-7 (can do its own topology), how would we expect this hypothetical implementation to treat the particular feature requests under consideration (PreferSameNode, PreferSameZone)?

If we do option 2 (xTP = PreferLocal), I think we would expect this implementation to respect it INSTEAD OF its own topology logic.

If we do option 3 (service.kubernetes.io/topology-aware-routing: PreferLocal), I think we would have to allow this implementation to disregard it, and use its own topology instead.

So, to the people who want this feature: Which makes more sense to you?

@LAMRobinson @ialidzhikov ^^^^^^^^^^^^^^^^^^^^^

TL;DR: If we add "Prefer..." to xTP, it's a rule that implementations must follow. If we implement it through topology hints, then it is a suggestion, rather than a rule.

Edit: followup question. If we defined "PreferLocal" as "Always use an endpoint on this node, if possible" and leave room in "if possible" to include "I know this endpoint is overloaded", would that be passable? For the built-in impl, it would choose the same-node endpoint 100% of the time, but it leaves room for other impls to be smarter. I'm mostly trying this on for size - trying to figure out how to give alternate implementations as much freedom as possible.

As an aside: EPSlice sort of serves as both input (user-provided) and as output (controller-provided),
which muddies the water. Perhaps hints should not have been inside EPSlices? :)

aojea · 2023-01-27T23:28:57Z

Add a new iTP/eTP policy or policies

Current policies were already difficult to define, and I think that there are still some gaps, but if Kubernetes networking keeps growing and adding more complex network scenarios I feel that we are going to hit contradictions or "hard to explain" things soon with the traffic policies.

Add a new heurisitic for topology

It seems that the biggest demand here is from people that require to have certain control, however, most people wants a "make my Services cheap by default, don't use inter-zone traffic if you can, but give me high availability"

Current implementation solves the later, but fails at small scale because of the guardrails of the heuristics.

I would like to experiment with an out-of-tree topology implementation first, to gather feedback before adding anything to Service. For this, we have the signaling/control-plane problem and the data-plane/forwarding problem , agree topology hints are kind of "this is my recommendation", but is what we have know, we can start experimenting with them #3765 (comment)
Also, I can see how implementations of dataplance can set options to follow Hints: Loose, Strict, Ignore , per example

robscott · 2023-01-28T00:34:50Z

Spent some time chatting with @thockin about this, I think we're trying to find a really tough balance here:

We want to provide some improved routing capabilities
Current limitations (primarily the lack of a feedback loop) result in less than ideal solutions being available today
While those solutions are better than nothing, we don't want to be locked into them forever in terms of backwards compatibility

I think what most people actually want is to "keep traffic as close as possible, as long as I'm not overloading endpoints." Unfortunately we don't have any visibility into the overloading part right now, and it would be very complex to add that.

With that background, I want to ensure we're leaving as much flexibility as possible with any approach here so we're not locking ourselves into supporting hints or a strict interpretation of "PreferLocal|Zone".

TL;DR: If we add "Prefer..." to xTP, it's a rule that implementations must follow. If we implement it through topology hints, then it is a suggestion, rather than a rule.

I think @thockin's point above is important, but it's hard to draw the right line here. I'd argue that rule names that being with "Prefer" should be sufficiently flexible to enable different implementations. For example:

If I don't have a way to know if endpoints are overloaded, I'm going to implement this by routing to any endpoints in the same locality if they exist.
If I know if endpoints are overloaded, I'm going to implement this by routing to endpoints with capacity in the same locality if they exist.

Today only 1 is possible, but I wouldn't want the spec to be so tight that it rules out 2.

Finally, I've been trying to fit hints into this view of the world. In my opinion, it should be a way to mitigate the risk involved in the first PreferZone approach described above. So the documentation would say something like this:

Setting TrafficPolicy to "PreferZone" will ensure that traffic is routed to endpoints within the same zone if they're available. Note that this may increase the risk of overloading local endpoints. Enabling Topology Aware Hints can help mitigate that risk...

This would mean that instead of offering TP=Cluster + Hints=Auto or TP=PreferZone we'd be offering TP=PreferZone plus an option to add hints on top of that to mitigate risk. That does result in a more narrow use case for hints, but I think it would be easier to understand and document.

danwinship · 2023-01-29T13:32:44Z

EPSlice sort of serves as both input (user-provided) and as output (controller-provided)

You're talking like selectorless services still exist, but we basically killed them to close a CVE.
If you don't override the default RBAC permissions (thus reintroducing a security hole that we provide no other way to avoid) then Endpoints and EndpointSlices are output-only.

danwinship · 2023-01-29T15:26:26Z

Does this implementation need to respect iTP/eTP, or does it disregard those EXPLICIT policy statements? These can be semantically important, so I think it has to respect them.

With either iTP: Local or eTP: Local, the service might have multiple endpoints on a given node, and the proxy needs to decide how to distribute traffic among those endpoints. While this isn't "topology" per se, it still falls under the broader "dynamic feedback and automatic topology optimization" topic. We obviously don't want to say "if the service has Local traffic policy, and has 2 local endpoints, you're not allowed to use dynamic feedback to figure out which endpoint is better able to serve the request".

So I'd say that traffic policy doesn't conflict with topology, it's just that it filters the set of endpoints before topology does, such that if you have both traffic policy and topology on a service, the topology algorithm may end up seeing only a single endpoint, or a set of endpoints that are all topologically equal.

May this implementation turn on topology by default / Must this implementation respect service.kubernetes.io/topology-aware-routing: Disabled / Must it make any guarantees about its topology results being the same as our implementation?

In some cases, topology is money, and it seems like the user ought to be able to express non-implementation-dependent preferences for keeping traffic within fiscally-relevant topological borders, even if doing otherwise would give slightly better performance. (And other users might want to explicitly express the opposite, because time is money too, and maybe a fast inter-zone request will cost the company less in the end than a slower intra-zone request.)

If we defined "PreferLocal" as "Always use an endpoint on this node, if possible" and leave room in "if possible" to include "I know this endpoint is overloaded", would that be passable?

I think that depends on why the user did "PreferLocal". Maybe they are worried about network saturation. In that case, they might prefer that the request to the PreferLocal service be answered more slowly by an overloaded local endpoint, rather than being passed off to another node and potentially slowing the network down for everybody.

danwinship · 2023-01-29T22:03:26Z

Tim and Rob both seem to be assuming that everyone always wants/needs/expects kube-proxy to prevent endpoint overload, which I think is equivalent to arguing that load distribution is the only reason anyone ever has a service with multiple endpoints. But that's not true. Eg, in the case of CoreDNS on every node, no one does that because they think their cluster needs N CoreDNS pods where N happens to equal the number of nodes. They do it because they think the advantage of having a local DNS pod on every node outweighs the disadvantage of running more CoreDNS pods than they really "need". Likewise, people may have a service that they know a single replica can handle the load for, but they want redundancy or topological spread.

Maybe we need a way to explicitly rank the traffic policies / endpoint selection heuristics that a service wants. (Warning: half-baked thoughts ahead.) Then people could say either trafficPolicy: ["Local", "BalanceEndpoints"] ("You must deliver locally, and must balance endpoint load if possible given the previous constraint.") or trafficPolicy: ["BalanceEndpoints", "Local"] ("You must balance endpoint load, and must deliver locally if possible given the previous constraint.") And people currently using eTP:Local can then clarify whether they mean trafficPolicy: ["PreserveSourceIP", "MinimizeHops", "BalanceEndpoints"] (eTP:Local) or just trafficPolicy: ["MinimizeHops", "BalanceEndpoints"] (eTP:PreferLocal). (Or trafficPolicy: ["PreserveSourceIP", "BalanceEndpoints"], meaning "if you have the Cilium-like ability to preserve source IP even when routing traffic to another node, you may use that rather than dropping packets that can't be delivered locally".)

thockin · 2023-01-30T01:23:41Z

@robscott

I think what most people actually want is

What most people want is to not have to think about this -- for the default to be good enough that the risk of further manual optimization outweighs the benefits.

If I follow your premise, you're saying that xTP="PreferZone" should be allowed to violate the "always choose same zone if possible" which turns it into a (possibly stronger) form of what we have with topology. Do you think that xTP=Cluster would be barred from doing the same? If not, what's the difference between them?

I have deep-running concern that "do the right thing" doesn't end up behind an opt-in (non-default) setting.

@danwinship

If you don't override the default RBAC permissions ... then Endpoints and EndpointSlices are output-only.

Default RBAC prevents USERS from writing slices - it doesn't prevent other systems (think multi-cluster). It's important that MCS and topology work together :)

LAMRobinson · 2023-01-31T07:23:15Z

I'd say that traffic policy doesn't conflict with topology, it's just that it filters the set of endpoints before topology does, such that if you have both traffic policy and topology on a service, the topology algorithm may end up seeing only a single endpoint, or a set of endpoints that are all topologically equal.

I like @danwinship's mental model here. It feels like there's a nice combination of options to give users here if we offer both Traffic Policy and Topology Hints.

Maybe we need a way to explicitly rank the traffic policies / endpoint selection heuristics that a service wants

I like this as well, one of my current bug bears with TP is the lack of "fail open" from local.

TP options:

Local
Zonal
Regional
Cluster (default)

So we could do TP = [ Local ] for fail shut, then TP = [Local, Zonal, Regional, Cluster] for "nearest" then TP = [Zonal,Cluster] for a Zonal fail-open.

I'd love this, though I admit it's just topology keys isn't it? I guess fixed to a set of known/cluster attributes..

It does feels like all these features compliment each other really well and we can set the defaults as they are today such that most people just use hints and don't think, but those that want more control can have it, whilst still being able to also use the hints algo if they want to try and avoid overload within their defined service scopes.

thockin · 2023-01-31T22:46:07Z

So we could do TP = [ Local ] for fail shut, then TP = [Local, Zonal, Regional, Cluster] for "nearest" then TP = [Zonal,Cluster] for a Zonal fail-open.

I'd love this, though I admit it's just topology keys isn't it?

Yes, it is :)

Also, we need to keep in mind that we are encumbered by past decisions, so converting the existing policies back into a list is not really feasible.

I'll refer back to #3293 (comment) which wants user opinions on how a hypothetical alternate control-plane would work in the presence of these design options.

LAMRobinson · 2023-02-01T09:06:42Z

If we do option 2 (xTP = PreferLocal), I think we would expect this implementation to respect it INSTEAD OF its own topology logic.

If we do option 3 (service.kubernetes.io/topology-aware-routing: PreferLocal), I think we would have to allow this implementation to disregard it, and use its own topology instead.

I think Option 2? I'm happy with Topology (SameNode/SameZone/SameRegion*) being secondary to xTP in priority and it feels like that makes more sense given that xTP is a "fail shut" behaviour.

Essentially I don't mind at all if this is "opt in/existing behaviour takes priority" - I just want a way to enforce the desired topology behavior over the default (xTP=cluster) without the black box safety checks of the current Hint auto algorithm implementation.

thockin · 2023-02-01T21:52:12Z

I just want a way to enforce the desired topology behavior over the ... current Hint auto algorithm implementation.

This is the thing I struggle with. We're considering adding a policy which we KNOW is dangerous because the current heuristic isn't strong enough.

Let me ask another loaded question:

If the hints logic was retooled so that it was active for you, and resulted in staying in-zone 50% of the time, would that be enough?

How about 85% ?

How about 98% ?

I'm trying to get a sense of whether there's a functional-correctness aspect of this, or just a performance optimization.

LAMRobinson · 2023-02-04T14:56:00Z

We're considering adding a policy which we KNOW is dangerous because the current heuristic isn't strong enough.
…
I'm trying to get a sense of whether there's a functional-correctness aspect of this, or just a performance optimization.

These are good questions. I think it’s naturally hard to give a % answer because the nature of the cause of the non-functional state matters more to me than the strict availability SLA concerns. For example these behaviours currently result in 0%:

In a 3 Zone cluster running a service distributed across a subset (1-2) of Zones
In a 3 Zone cluster running a service distributed evenly across the 3 Zones works, but a RollingRestart (n-1/n+1) causes 0% until the restart finishes.
1. I’ve tried this with 3,6 and 9 instances - I appreciate that I’d likely hit a number of instances where this would work, but that bar is too high for us as we’d be scaling not based on actual load requirements, but simply to get the hints algorithm to be satisfied. For example most applications with 3 instances are sitting at 10-20% util as is - I don’t need 12+ of them.

Similarly there are situations where specific services having topology outages are significantly worse than just overloading the service, for example I would rather a client receive a slow response than sending that traffic out of zone (cloud zone costs / WAN link limitations)

As I’ve mentioned somewhere else, in most cases I don’t want fail shut here (as in I want to fall back eventually to “any instance anywhere”) - which means presumably there is some perfect heuristic that’ll meet the above situations and result in perfect behaviour. As such it feels like iterating on this solution is the right thing to do, though I’d like to see an extension of the xTP=PreferLocal to include Zonal and Region.

That said the lack of control around the choices the algorithm makes remains a key frustration about this feature. I want to be able to define what overload means on a per-service basis rather than have the algorithm incorrectly guess for me.

thockin · 2023-02-05T00:23:47Z

That said the lack of control around the choices the algorithm makes remains a key frustration about this feature. I want to be able to define what overload means on a per-service basis rather than have the algorithm incorrectly guess for me.

It's unlikely that any heuristic we build will do exactly what you think you want. I do not think that we want to expose all the knobs and parameters that go into the logic we've got. At least not now, not yet.

I'm tentatively OK with adding other heuristics and letting people choose among them, but I emphasize that these are intentionally a bit vague, because we reserve the right to keep trying to do better by default.

danwinship · 2023-02-21T16:19:43Z

Related to the "proxy trying to holistically determine endpoint load" thing: persistent clients + periodically-updated servers = massive imbalance: kubernetes/kubernetes#37932 (comment)

thockin · 2023-03-06T23:21:48Z

#3765 (review)

Which now asks: should we close this KEP in favor of heuristics, until such time as that proves unworkable?

k8s-ci-robot requested review from andrewsykim, robscott and thockin May 3, 2022 21:04

danwinship mentioned this pull request May 3, 2022

PreferSameNode Traffic Distribution (formerly PreferLocal traffic policy / Node-level topology) #3015

Open

12 tasks

robscott reviewed May 4, 2022

View reviewed changes

danwinship force-pushed the node-level-topology branch from 0594c9e to 3a6ca47 Compare May 6, 2022 12:46

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 6, 2022

thockin mentioned this pull request Jul 7, 2022

Topology Aware Hint does not consider node taints kubernetes/kubernetes#110714

Closed

thockin self-assigned this Jul 7, 2022

danwinship mentioned this pull request Sep 21, 2022

KEP-2086: promote ServiceInternalTrafficPolicy to GA in v1.26 #3509

Merged

thockin reviewed Sep 29, 2022

View reviewed changes

thockin assigned robscott Sep 29, 2022

robscott reviewed Sep 30, 2022

View reviewed changes

robscott mentioned this pull request Oct 1, 2022

Updating Topology Aware Hints KEP with GA Graduation Criteria #3572

Merged

marosset reviewed Oct 3, 2022

View reviewed changes

keps/sig-network/3015-node-level-topology/kep.yaml Outdated Show resolved Hide resolved

marosset reviewed Oct 3, 2022

View reviewed changes

keps/sig-network/3015-node-level-topology/README.md Show resolved Hide resolved

KEP-3015: Node-level topology

1bd3801

danwinship force-pushed the node-level-topology branch from 3a6ca47 to 1bd3801 Compare October 4, 2022 13:43

robscott mentioned this pull request Jan 20, 2023

KEP-2433 Topology Aware Hints: Adding SameZone heuristic and other tweaks #3765

Merged

ialidzhikov mentioned this pull request Feb 1, 2023

Add support for topology-aware traffic routing gardener/gardener#7191

Merged

azelezni mentioned this pull request Mar 6, 2023

de-deprecate Topology-aware traffic routing with topology keys kubernetes/kubernetes#116300

Closed

thockin closed this May 30, 2023

danwinship deleted the node-level-topology branch May 31, 2023 16:01

danwinship mentioned this pull request Feb 1, 2024

KEP-4444: Routing Preference for Services #4445

Merged

danwinship mentioned this pull request Oct 24, 2024

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Closed

frittentheke mentioned this pull request Apr 24, 2025

KEP-4742: Copy topology labels from Node objects to Pods upon binding/scheduling kubernetes/kubernetes#127092

Merged

		This KEP adds a new topology hint, to tell kube-proxy that a Service
		is expected to have an endpoint on every node most of the time, and so

		- The pods that make up the service endpoints all have an
		`OwnerReference` pointing to the same `DaemonSet`.

		- Allow configuring a service so that connections will be delivered to
		a local endpoint when possible, and a remote endpoint if not.

		It might add a new field to EndpointSliceHints (which would be unset
		in most EndpointSlices).


		### Non-Goals

		N/A

KEP-3015: Node-level topology #3293

KEP-3015: Node-level topology #3293

Uh oh!

Conversation

danwinship commented May 3, 2022

Uh oh!

robscott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thockin commented Jul 7, 2022

Uh oh!

danwinship commented Jul 9, 2022

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robscott left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

danwinship Sep 30, 2022 •

edited

Loading