Load balancer extensibility #5598

htuch · 2019-01-14T17:26:38Z

Load balancers seems to be a natural first class extension point in Envoy, but we don't support this today. As load balancer behaviors become more complicated, e.g. witness issues such as #4685, and the need for custom locality handling in Istio (CC @rshriram @costinm), it would be great to allow for LB extensions and even CDS delivery of LB behaviors via Lua/WASM.

Complications that make this challenging include the tight integration between LB and various ClusterManager data structures such as host, priority and locality data structures. We would neat a tighter, more stable and better defined API here. Also, allowing reuse of existing LBs and just minor customization, without having to reimplement the entire LB, would be useful.

I'm opening this issue for discussion and long-term evolution of the LB implementation, help wanted.

CC @mattklein123 @snowp @cpakulski @rshriram @costinm

venilnoronha · 2019-01-22T21:29:39Z

/sub

mattklein123 · 2019-03-01T05:12:43Z

@htuch heads up that I was discussing this with @HenryYYang today and I don't think we are going to end up needing this for the redis cluster work so Lyft won't be implementing.

htuch · 2019-03-01T21:23:56Z

Ack; it's still on our backlog but not a super high priority.

snowp · 2019-11-23T01:30:02Z

@markdroth I see you landed #7744 a while ago that seems to tie nicely into this issue. I was gonna spend some time thinking of how to do this over the next few months, and the config approach in your PR is very similar to what I was thinking. Is this something you're actively working on?

markdroth · 2019-12-02T16:48:16Z

@snowp My impetus for #7744 was for gRPC clients being configured via xDS, not for Envoy, so I am not personally working on the Envoy-side changes to support this. But I do agree that this should be implemented in Envoy, and we had discussed the possibility of @htuch taking this on at some point. But if this is something that you need sooner, I suspect he would not mind if you take this on.

No matter who does the Envoy-side work, I would be happy to consult on functionality and semantics, to make sure that things stay consistent between Envoy and gRPC.

htuch · 2019-12-02T23:15:23Z

@markdroth yeah, @snowp and I chatted at EnvoyCon. He is the Envoy-side domain expert in this area, so it would be awesome if he can own this for us.

snowp · 2019-12-06T06:41:13Z

One thing I’m thinking about is whether it would make sense to move the generic lb configuration to the ClusterLoadAssignment proto, which would make it possible to reconfigure arbitrary lb details through EDS (or CDS since the proto is inlined). If we were to decouple endpoint details from the lb configuration (e.g. by using the named_endpoints) and provide a way to cross reference the structure and the endpoints it seems like we’d be able to provide a very flexible API for statically compiling in custom load balancers.

If this general approach seems reasonable I’d be happy to put together a doc.

markdroth · 2019-12-06T18:23:53Z

@snowp I don't quite understand what you mean about cross-referencing the structure and the endpoints. Can you give a concrete example of how this might work?

In general, it seems like EDS is mainly dynamic data (i.e., it changes to shift load around), whereas CDS is more configuration data (i.e., it changes only when humans modify it), and I would think that the LB policy configuration fits more into the latter category. But if there are good reasons to do it in EDS, I'm not necessarily opposed.

I have actually considered putting the LB config in EDS at least twice, and both times we didn't wind up going with that approach. Let me provide some context to explain when I considered that and why I didn't go with that approach.

gRPC has the ability to independently select the policy for each level of the routing hierarchy -- i.e., we can choose the policy for picking the locality and then separately choose the policy for picking the endpoint within the locality. We do this by having the locality-picking policy create a child policy for each locality. For each request, the parent policy picks the locality and delegates to the child policy for the chosen locality to pick the endpoint within that locality. In effect, we have a tree of LB policies whose structure matches that of the organizational hierarchy of the endpoints.

I specifically wanted to be able to support that structure when I added the new fields in #7744. The approach that I went with was essentially the same one that we use in gRPC: we have the config for the parent policy include a field that tells it what child policy to use and what config to pass to the child policy. So, for example, if we had a locality-picking policy called "closest_locality", its config might be expressed using the following proto message:

message ClosestLocalityLbConfig {
  envoy.api.v2.LoadBalancingPolicy child_policy = 1;
}

We could use this to configure the "closest_locality" LB policy for locality picking and then independently choose a "weighted_round_robin" policy for endpoint picking. Here's how it would look in CDS:

load_balancing_policy: {
  policies: {
    name: "io.grpc.builtin.closest_locality"
    typed_config: {
      type_url: "type.googleapis.com/io.grpc.ClosestLocalityLbConfig"
      value: {
        child_policy: {
          policies: {
            name: "io.grpc.builtin.weighted_round_robin"
          }
        }
      }
    }
  }
}

While working on #7744, I had originally considered saying that we would configure the locality-picking policy in CDS and then the endpoint-picking scheme in the Locality message in EDS, so that the latter could potentially even be overridden on a per-locality basis. That approach would have worked fine for hierarchical policies, but it would not have provided an intuitive way to represent the existing Envoy LB policies that are non-hierarchical. For example, I understand that Envoy's current ROUND_ROBIN policy handles weighting for both localities and endpoints in a single mechanism. Because it's non-hierarchical, it's not clear how it would be represented in a config that wants to configure each level separately. There are ways we could have made this work. One way would have been to just configure the locality-picking policy and then leave the endpoint-picking policy unset. Another way would have been to configure the same policy in both places, and have the two pieces work together to do the right thing. But neither of these seemed as flexible as the approach we wound up using: because the hierarchical structure is encoded only in the per-policy config for policies that support it, the top-level config can deal only with the top-level LB policy, and any delegation to child policies that may happen inside of the top-level policy is hidden from the rest of the system.

The other time that I considered making this configurable in EDS was for the case I described in #7454. We have a use-case where we have endpoints in two different priorities and we need to use a different endpoint-picking LB policy for each one. We had originally thought about addressing that by allowing per-priority endpoint-picking policy overrides in EDS, but @htuch suggested that we use the aggregate cluster design instead.

Stepping back a bit, I have to say that it does seem a little strange that we have two different prioritization mechanisms that both basically do the same thing, one in the aggregate cluster design and another in the priority field for localities. In the long run as part of UDPA, I wonder if it would make sense to try to restructure this such that a cluster defines priorities as a top-level concept, and then sets localities and endpoints for each priority. In other words, instead of this:

Aggregate Cluster -> Prioritized Cluster -> Prioritized Locality -> Endpoint

we would have this:

Cluster -> Priority -> Locality -> Endpoint

Then we could set defaults at the Cluster level but also override them as needed at the priority level.

Anyway, this may be more detail than you are actually interested in, but I hope the context is useful. Please let me know what you think.

htuch · 2019-12-06T22:41:52Z

@snowp didn't the last discussion around EDS-overriding-CDS end up with the aggregate cluster compromise? Was that missing some use case or does custom LB introduce new options that might need overriding in EDS?

htuch · 2019-12-06T22:44:02Z

@markdroth isn't "Cluster -> Priority -> Locality -> Endpoint" what we had before aggregate cluster? :)

markdroth · 2019-12-06T22:49:34Z

@htuch No, before aggregate cluster, we had "Cluster -> Prioritized Locality -> Endpoint". That structure couldn't handle overriding various things on a per-priority basis, which is why we had to add aggregate cluster. What I'm proposing is combining the two places that we're doing prioritization into a single one.

htuch · 2019-12-06T23:29:21Z

Yeah, that makes sense. I think this is going to be on of the more complex aspects of UDPA; right now I'm getting a feel that there are some pretty non-controversial wins we can make in UDPA, for example transport protocol, routing and the moral equivalent of EGDS (i.e. discovering endpoints groups without any of this implied priority and policy). We're going to have to look at a few other proxies and see what they might need in terms of expressability etc here.

snowp · 2019-12-07T16:55:11Z

There seems to be a fundamental question we should probably figure out: how much of the existing load balancing logic should be core vs extensions? In my mind I was imagining a world where everything was just extensions, which is why I was thinking about having an arbitrary config stanza as part of the CLA proto. Imagine something like this on the CLA proto:

named_endpoints:
  foo: ...
  bar: ...
lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo] # priority 0
        inner_lb:
          name: envoy.load_balancers.random
    - endpoints: [bar] # priority 1
        inner_lb: 
          name: envoy.load_balancers.least_request

When I said cross-reference in the previous comment, I was referring to the fact that we're naming endpoints and then referring to them by name to define the LB structure in the config. Naming them vs defining inline means that we can refer to them multiple times:

lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo, baz] # priority 0
        inner_lb:
          name: envoy.load_balancers.locality
          config:
            - weight: 10
               endpoints: [foo]
               inner_lb: 
                 name: envoy.load_balancers.random
            - weight: 20
               endpoints: [baz]
               inner_lb: 
                 name: envoy.load_balancers.random

With this, the configuration stored on the CLA is not just the LB policy, but the entire LB hierarchy. This is a pretty substantial change in how load balancing works in Envoy (the LB impls track the LB structure instead of the cluster), but it would make it possible to split up the LB code substantially and make it easier to pick and choose what LB features you want. It also substantially changes how host changes would be propagated in code, due to its current reliance on the PrioritySet which would no longer be a core part of the Cluster.

The composable API structure would also make it a lot easier for custom LB algorithms to be used for intermediate steps:

lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo] # priority 0
        inner_lb: 
          name: envoy.load_balancers.random
    - endpoints: [bar, baz, bax] # priority 1
        inner_lb: 
          name: my.custom.lb
          config:
            inner_lb: envoy.load_balancers.least_request

where the high level idea is that my.custom.lb is used to select which subset of priorirty 1 should be targeted, and delegates the selection from that subset to the least_request LB.

This is all going down the path of trying to make everything extensible. It might be that we want to make certain things supported directly instead, which would probably warrant a different API.

markdroth · 2019-12-09T16:13:31Z

@snowp In general, I agree that making everything extensible is the right approach. I think that the built-in LB policies can simply be provided as plugins that are shipped with Envoy and available by default. This is what we do in gRPC for the few LB policies that we provide out of the box.

I think there's also another benefit to what you're proposing here, which is that it actually removes the restriction that exists today where endpoints must be grouped into localities with associated priorities. Instead, the hierarchy would be completely customizable by the user: it could be a single, flat list of endpoints, or it could be a multi-level hierarchy of priority, region, locality, and endpoint. In effect, xDS would no longer be enforcing its own notion of the hierarchy in which endpoints are organized. I think this would add a lot of flexibility.

One thing that may be fairly complex here is to figure out how to handle the priority failover logic that is used in choosing priority levels and in locality weighted load balancing. I don't think that's an insurmountable issue; it's just something that we need to carefully consider when we design the LB policy API. When we get further into the details, I'd be happy to show you how our API in gRPC handles this sort of thing.

The one noteworthy difference I see between what you're proposing and what we do in gRPC is that you would be explicitly setting the targets along with the config for each level of LB policy, whereas in gRPC, we usually wind up specifying those two things separately. However, the design you propose does actually seem a bit more flexible, since it allows more easily overriding the behavior for each individual child of a given node in the LB policy tree. And gRPC can certainly adapt to this approach when we start using this part of UDPA.

htuch · 2019-12-10T04:41:48Z

I think the endpoint cross-refs is basically what we are thinking about in EGDS (see mentions of this in #8400), I think there is a lot of alignment around this idea.

snowp · 2019-12-10T16:53:49Z

I think there's also another benefit to what you're proposing here, which is that it actually removes the restriction that exists today where endpoints must be grouped into localities with associated priorities. Instead, the hierarchy would be completely customizable by the user: it could be a single, flat list of endpoints, or it could be a multi-level hierarchy of priority, region, locality, and endpoint. In effect, xDS would no longer be enforcing its own notion of the hierarchy in which endpoints are organized. I think this would add a lot of flexibility.

I think this is pretty key in providing a truly extensible LB experience: having the API (and by extension the implementations) be opinionated about the hierarchy makes it hard for custom LBs to efficiently store endpoints in a different format: in Envoy, this would most likely result in these LB extensions managing their LB state in addition to the PrioritySet, resulting in a lot of wasted space and time spent processing host changes.

Another approach I had in mind for splitting LB config and endpoints was to make use of endpoint metadata to provide per LB extension information in the endpoint metadata, something like:

named_endpoints:
  foo: 
    address: ...
    metadata: 
       envoy.lb.priorities:
         priority: 1
       envoy.lb.endpoint_weighting:
         weight: 2
       envoy.lb.locality:
         locality:
           zone: ...

This moves the endpoints out of the dynamic LB config, which would make it very easy to provide partial EDS updates as talked about in #8400: the LB config remains relatively fixed, and endpoints can be added/removed from the endpoint map without having to worry about modifying the arbitrary tree formed by the LB config.

This sacrifices some wire overhead (ie potentially lots of config per endpoint) and possibly harder to optimize update code (you have to keep scanning the endpoints for which ones match a priority, locality, etc., vs knowing the names from the LB config) for the reduction in coupling between the endpoints and the LB config.

markdroth · 2019-12-12T20:22:13Z

I think the most flexible way of approaching this might be to simply have each LB policy control how it identifies its children. There are some cases where it will be useful to directly configure a policy's children in its LB config, but there are other cases where the policy will need to dynamically determine its children at run-time. That might be done based on a request header, or it might be controlled by some external control plane.

One very flexible way of doing this would be to allow each policy to create its own xDS resource type, which it could query via ADS. For example, let's say that we want a simple hierarchy of the following form:

Cluster -> Locality -> Endpoint

There are no priorities. We want to use simple weighted round-robin picking of the locality and then use the least_request policy for picking the endpoint within the locality. To do this, we can define a new xDS resource type that defines the set of localities in the cluster, which would look something like this:

message ClusterLocalities {
  message Locality {
    // Name of locality.
    string name = 1;
    // Weight.
    uint64 weight = 2;
    // List of EGDS resources for this locality.
    messge Egds {
      ConfigSource source = 1;
      string name = 2;
    }
    repeated Egds egds = 3;
  }
  repeated Locality locality = 1;
}

The config message for the top-level LB policy can look something like this:

message LocalityLbConfig {
  // How to fetch the cluster locality info.
  ConfigSource cluster_locality_source = 1;
  string cluster_locality_name = 2;
  // The child policy to create for each locality.
  envoy.api.v2.LoadBalancingPolicy child_policy = 3;
}

The top-level LB policy will fetch the ClusterLocality resource as specified in the config and create the specified child policy for each config.

The child policy might have a config message that looks like this:

message EndpointLeastRequest {
  // List of EGDS resources for this locality.
  messge Egds {
    ConfigSource source = 1;
    string name = 2;
  }
  repeated Egds egds = 1;
}

The child policy will fetch the EGDS resources and perform the least_request algorithm across the resulting endpoints to pick the endpoint for each request.

An actual configuration might look like this:

lb_config:
  name: envoy.load_balancers.locality_round_robin
  config:
    locality_group_source:
      ads:
    locality_group_name: "my_locality_group"
    child_policy:
      name: envoy.load_balancers.least_request

Note that in this case, the config for the top-level LB policy is fully specified in the LB config; it specifies which locality group to use directly in the config. However, the child policy specifies the name only, not the corresponding config, because the top-level policy will construct the config for the child policy based on the Locality data it obtains from the management server. (There are obviously other ways we could have structured this if we wanted, but this is just an example.)

This approach provides flexible decoupling of endpoint data from the LB config, but it would avoid the overhead of tagging everything as metadata on the endpoints, which both avoids the wire overhead and provides more flexibility for non-leaf policies that don't know or care about endpoints.

As a side note, this also makes me think that as part of UDPA, we should consider splitting up some parts of what's currently in RDS and moving it to this new LB policy mechanism instead. For example, it would be trivial to express things like cluster_header or weighted_clusters as an LB policy.

htuch · 2019-12-13T18:01:17Z

@markdroth it looks like you have a lot of great ideas for where to take the API. Do you and @snowp want to put together a strawman (similar to what we did for routing) for UDPA-WG? That way we can explore this in a doc rather than GH threads, which are hard to follow and then feed that next year into some concrete protos.

I would aim for the v3 API (which is where this present issue is likely to intercept) to be a bit more constrained. EGDS makes sense and we need that sooner rather than later, but we should probably descope as much as possible to ensure we can deliver in that time frame.

yxue · 2019-12-13T19:01:12Z

/cc @yxue

jmarantz · 2020-04-09T12:24:32Z

/cc @jmarantz

gupta-deeptig · 2021-06-04T18:41:49Z

I was looking at how to implement custom load-balancing algorithms in envoy, and based on this thread, this needs to be a cluster based extension? (similar to what was done for Redis)? The EDGS and other ideas here are not merged, right? Is there any reference on what we can use/restrictions for custom algorithm today?

Enables `LOAD_BALANCING_POLICY_CONFIG` enum value in `LbPolicy` and supports typed load balancers specified in `load_balancing_policy`. Continues work done by Charlie Getzen <charliegetzenlc@gmail.com> in #15827. Custom load balancers specified by `load_balancing_policy` are created as implementations of `ThreadAwareLoadBalancer`. Thread-local load balancers can be implemented as thread-aware load balancers that contain no logic at the thread-aware level, i.e. the purpose of the thread-aware LB is solely to contain the factory used to instantiate the thread-local LBs. (In the future it might be appropriate to provide a construct that abstracts away thread-aware aspects of `ThreadAwareLoadBalancer` for LBs that don't need to be thread-aware.) A cluster that uses `LOAD_BALANCING_POLICY_CONFIG` may not also set a subset LB configuration. If the load balancer type makes use of subsetting, it should include a subset configuration in its own configuration message. Future work on load balancing extensions should include moving the subset LB to use load balancing extensions. Similarly, a cluster that uses `LOAD_BALANCING_POLICY_CONFIG` may not set the `CommonLbConfig`, and it is not passed into load balancer creation (mostly owing to its dubious applicability as a top level configuration message to hierarchical load balancing policy). If the load balancer type makes use of the `CommonLbConfig`, it should include a `CommonLbConfig` in the configuration message for the load balancing policy. Considerations for migration of existing load balancers: - pieces of the `ThreadAwareLoadBalancerBase` implementation are specific to the built-in hashing load balancers and should be moved into a base class specifically for hashing load balancers. As it stands, custom load balancing policies are required to implement a `createLoadBalancer()` method even if the architecture of the LB policy does not require a hashing load balancer. I think we would also benefit from disentangling `ThreadAwareLoadBalancerBase` from `LoadBalancerBase`, as the former never actually does any host picking. - as we convert existing thread-local load balancers to thread-aware load balancers, new local LBs will be re-created upon membership changes. We should provide a mechanism allowing load balancers to control whether this rebuild should occur, e.g. a callback that calls `create()` for thread-aware LBs by default, which can be overridden to do nothing for thread-local LBs. Risk Level: low Testing: brought up a cluster with a custom load balancer specified by `load_balancing_policy`; new unit tests included Docs Changes: n/a Release Notes: Enable load balancing policy extensions Platform Specific Features: n/a Fixes #5598 Signed-off-by: Eugene Chan <eugenechan@google.com>

abhiroop93 · 2023-07-17T06:46:10Z

@markdroth @htuch @snowp
By my understanding, It is possible to use hierarchical load balancing.
But I have been trying to set that up, but there is no sample documentation for the same.
How do I go about setting a hierarchical policy, for eg:

First to check the closest endpoints (by zone)
And then to select which endpoint to route to, within that zone.

Can you share a sample doc/code snippet for the same?

markdroth · 2023-07-17T15:09:45Z

We don't have any "parent" LB policies today, but we do now have the structure in place such that it should not be too hard for you to write such a policy.

What is the exact behavior you want here for selecting the zone? If you always want the closest zone for any given client, why not just have the control plane send only the endpoints in that zone to the client, so that no matter what it picks, it gets the right thing?

htuch added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Jan 14, 2019

mattklein123 mentioned this issue Jan 27, 2019

Support for redis cluster protocol #5697

Closed

derekargueta mentioned this issue Mar 22, 2019

M:N zone aware routing #6333

Open

htuch mentioned this issue May 29, 2019

upstream: allow a cluster to provide its own load balancer #7081

Merged

mattklein123 mentioned this issue Dec 4, 2019

http: A way to limit number of concurrent requests per endpoint #9213

Closed

snowp added the area/load balancing label Dec 12, 2019

markdroth mentioned this issue Dec 16, 2019

[WiP] L7 routing straw man. cncf/udpa#4

Closed

markdroth mentioned this issue Jun 1, 2020

A28: gRPC xDS traffic splitting and routing grpc/proposal#178

Merged

markdroth mentioned this issue Jun 22, 2020

[WIP] router: support cluster picking via filter state and fallback #11661

Closed

mattklein123 mentioned this issue Aug 7, 2020

load balancing: Check each host at most once in LeastRequests LB #11006

Closed

Sodman mentioned this issue Dec 10, 2020

Is it possible to use wasm to extend the load balance strategy? solo-io/wasm#225

Closed

mattklein123 mentioned this issue Jan 31, 2021

Maglev Shuffle Shard #14804

Closed

mattklein123 mentioned this issue Mar 26, 2021

HRW Hash Load Balancing #15596

Open

htuch mentioned this issue Apr 5, 2021

Enable LoadBalancing Extensions #15827

Closed

pianiststickman mentioned this issue Jul 19, 2021

Enable load balancing policy extensions #17400

Merged

lizan closed this as completed in #17400 Aug 14, 2021

markdroth mentioned this issue Jan 13, 2022

xds: add config for wrr_locality and round_robin LB policy extensions #19517

Merged

markdroth mentioned this issue Apr 1, 2022

xds: Envoy should support its built-in LB policies as extensions #20634

Open

conqerAtapple mentioned this issue Sep 30, 2022

[load balancer] Introducing deterministic aperture load balancing. #22991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load balancer extensibility #5598

Load balancer extensibility #5598

htuch commented Jan 14, 2019

venilnoronha commented Jan 22, 2019

mattklein123 commented Mar 1, 2019

htuch commented Mar 1, 2019

snowp commented Nov 23, 2019

markdroth commented Dec 2, 2019

htuch commented Dec 2, 2019

snowp commented Dec 6, 2019

markdroth commented Dec 6, 2019

htuch commented Dec 6, 2019

htuch commented Dec 6, 2019

markdroth commented Dec 6, 2019

htuch commented Dec 6, 2019

snowp commented Dec 7, 2019

markdroth commented Dec 9, 2019

htuch commented Dec 10, 2019

snowp commented Dec 10, 2019

markdroth commented Dec 12, 2019

htuch commented Dec 13, 2019

yxue commented Dec 13, 2019

jmarantz commented Apr 9, 2020

gupta-deeptig commented Jun 4, 2021

abhiroop93 commented Jul 17, 2023

markdroth commented Jul 17, 2023

Load balancer extensibility #5598

Load balancer extensibility #5598

Comments

htuch commented Jan 14, 2019

venilnoronha commented Jan 22, 2019

mattklein123 commented Mar 1, 2019

htuch commented Mar 1, 2019

snowp commented Nov 23, 2019

markdroth commented Dec 2, 2019

htuch commented Dec 2, 2019

snowp commented Dec 6, 2019

markdroth commented Dec 6, 2019

htuch commented Dec 6, 2019

htuch commented Dec 6, 2019

markdroth commented Dec 6, 2019

htuch commented Dec 6, 2019

snowp commented Dec 7, 2019

markdroth commented Dec 9, 2019

htuch commented Dec 10, 2019

snowp commented Dec 10, 2019

markdroth commented Dec 12, 2019

htuch commented Dec 13, 2019

yxue commented Dec 13, 2019

jmarantz commented Apr 9, 2020

gupta-deeptig commented Jun 4, 2021

abhiroop93 commented Jul 17, 2023

markdroth commented Jul 17, 2023