Title: Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination)
Description:
I'd like to propose a new load balancing policy extension —
envoy.load_balancing_policies.per_worker_subset — that gives each Envoy worker its own subset of upstream hosts and round-robins (or least-requests, etc.) within only that subset. Filing as a design proposal per EXTENSION_POLICY.md; a complete PR (~1600 LOC including unit tests, proto, build wiring, changelog) is implemented and held back pending design ack.
We are running this custom subset extension in our production and have observed dramatic connection reduction (by up to x25 times). Let me know if you recommend to make this a contrib extension instead of a core extension - I'm happy to be a user sponsor.
Motivation
For Envoys fanning out to large upstream clusters (hundreds to thousands of hosts) where the upstream enforces a short server-side HTTP/1 keepalive timeout, stock per-cluster LBs (RoundRobin, LeastRequest) eventually open one connection from every worker to every upstream host. The result on real workloads is a per-Envoy upstream conn count in the hundreds of thousands (on a machine with 128 vcores), dominated by upstream-initiated closes (the backend reaping idle conns inside its keepalive window) and a high lifetime conn-create rate. The conn pool is mostly idle — neither the per-conn QPS nor the request/keepalive ratio supports keeping that many sockets warm. Operational graphs will be attached as a follow-up comment.
The root cause is structural: Envoy's per-worker connection-pool model means every worker independently maintains pools to every host, even when the per-worker QPS makes that pool size unjustifiable. Capping per-worker fanout to a subset of K hosts collapses the conn count to W × K and pushes each conn into the active-use band of the upstream's keepalive window.
Design overview
The extension is two-axis — partitioning strategy × within-subset host-selection strategy:
Partitioning strategies
- EQUAL_PARTITIONS (default): deterministic, non-overlapping slice of
K = ceil(N/W) hosts from a stable address-sorted ordering. Partition shape is perfectly even. Starting offset is rotated per-Envoy via a hash of the bootstrap node id so the "which worker IDs are pinned to which hosts" mapping varies across the fleet; within a single Envoy, disjointness is preserved.
- RANDOM_PARTITIONS: each worker independently samples
subset_size hosts uniformly at random from healthy hosts. Probabilistic coverage; re-samples on each membership update.
Host-selection strategies (orthogonal to partitioning)
- SIMPLE_ROUND_ROBIN: inline
next_index_ % K round-robin over the worker's subset. No weight handling, no slow-start.
- ENVOY_ROUND_ROBIN: delegate to the stock
RoundRobinLoadBalancer constructed against the worker's subset via a synthetic PrioritySet. Honors per-host weights (EDF), slow_start_config, and locality-weighted LB.
- ENVOY_P2C: delegate to the stock
LeastRequestLoadBalancer (P2C on rq_active) over the same synthetic PrioritySet.
Per-worker fallback
Unlike stock LBs which apply healthy-fraction check cluster-wide, this extension applies the fallback decision per-worker, scoped to each worker's own slice. A brief deploy that flips 50% of hosts unhealthy no longer trips every worker on the same Envoy simultaneously — only the workers whose K-host slice happens to overlap a heavily-unhealthy band fall back. The other workers keep their stable healthy-only subset. This eliminates the synchronized cluster-wide conn-pool churn that a cluster-wide threshold would cause.
The fallback_threshold field is in percent points [0, 100], defaults to 50, independent of common_lb_config.healthy_panic_threshold (which remains scoped to the cluster).
Stats
Follows the cluster.<name>.lb_<lbname>_<event> convention (matching lb_subsets_* on the existing subset LB):
lb_per_worker_subset_rebuilds
lb_per_worker_subset_slice_fallback
lb_per_worker_subset_slice_empty_healthy
lb_per_worker_subset_empty_returns
Relationship to existing reserved random_subsetting proto
There's a reserved-but-unimplemented envoy.extensions.load_balancing_policies.random_subsetting.v3 package in the tree ([#not-implemented-hide:]). I want to be explicit about why I'm proposing a separate extension rather than implementing that proto:
random_subsetting is a per-client model. Its semantics require the cluster's lb_subset_config (or equivalent xDS-side configuration of subset hash / metadata keys) to be set so every client deterministically picks the same subset_size endpoints via rendezvous hashing. The decision is shaped by upstream cluster xDS config.
per_worker_subset is, by design, a no-touch Envoy-local decision. Each Envoy worker derives its subset purely from (W, worker_id, envoy_seed, partitioning_strategy) — no metadata keys, no client identity, no upstream cluster configuration changes, no EDS-side coordination. Switching a cluster from RoundRobin to per_worker_subset requires only flipping load_balancing_policy on the cluster; the upstream service owner is not involved.
This makes the extension operationally cheap to roll out on existing clusters where the upstream owner cannot or will not change the EDS payload, and lets it function as a connection-pool-sizing tool independent of any client-identity hashing scheme.
The two are therefore complementary rather than overlapping: random_subsetting solves "every client must converge on the same K endpoints"; per_worker_subset solves "every Envoy worker must cap its outbound conn fanout to K endpoints". Open to discussion if the maintainers see a path to merging them under a shared proto with mode discriminators, but the semantics don't collapse into a single mode.
Alternatives considered
| Alternative |
Why insufficient |
Existing subset LB (envoy.load_balancing_policies.subset) |
Requires upstream EDS metadata + matching client metadata. Doesn't solve the conn-fanout problem without upstream cooperation, and the unit is the subset definition, not the worker. |
Tighter idle_timeout on Envoy's HTTP/1 connection pool |
Helps reduce idle conns but doesn't address fanout breadth; you still open conn-to-every-host before it idles. |
random_subsetting (when implemented) |
Per-client semantics; needs xDS-side config; doesn't help the conn-pool-sizing case where the same client (Envoy) needs to cap its OWN per-worker fanout. |
| Manually sharding the cluster into N smaller clusters |
Operationally expensive; loses per-host-weight semantics; doesn't work when the upstream is owned by another team. |
Risks and trade-offs
- Priority-unaware. Routes only to priority 0. Clusters that need true priority routing (DEFAULT vs HIGH spillover) should keep using a different LB.
- Locality-unaware at the partition step. Doesn't honor
common_lb_config.zone_aware_lb_config at the partition level (the inner stock LB still honors its own locality config over the subset for ENVOY_ROUND_ROBIN / ENVOY_P2C).
- EQUAL_PARTITIONS slice membership shifts when N changes (e.g., EDS adds/removes hosts). Stable health changes alone do not shift slice positions - fall back handled within a slice/per-worker subset.
- Subset isn't sticky across Envoy restarts. Workers get a fresh partition on each restart.
- K resolution at config-load time. Total worker count
W is resolved from bootstrap.concurrency (or hardware_concurrency() fallback). For cgroup-limited environments (Kubernetes), --concurrency must be set explicitly or K will be sized off host CPU rather than cgroup CPU.
Quality and operational fit
- All four extension files (proto + LB + factory + tests) are written, single commit, signed off. Stats follow upstream convention. CODEOWNERS,
extensions_metadata.yaml, extensions_build_config.bzl, api/versioning/BUILD, and changelog are all updated.
- Unit tests cover: both partitioning strategies; all three host-selection strategies; single/multi-worker cases;
N < W degenerate case; envoy_seed rotation; per-worker fallback for both strategies; fallback_threshold=0 edge case; oversized subset_size; empty cluster; counter increment for fallback path.
security_posture: robust_to_untrusted_downstream_and_upstream. status: alpha initially.
Asks for the maintainers
- Sponsor: I'm looking for an LB-area maintainer to sponsor per EXTENSION_POLICY.md. Happy to do a synchronous walk-through if helpful.
- Second CODEOWNERS reviewer: my draft CODEOWNERS line currently has
@sudheerv @UNOWNED; need a real second reviewer before merge.
- Naming: the extension is currently
envoy.load_balancing_policies.per_worker_subset. Alternatives considered: worker_local_subset, worker_partition. Happy to rename.
random_subsetting overlap: input on whether the proto packages should be co-located, share helper messages, or stay separate. My current view is "separate" given the semantic difference (Envoy-local vs xDS-coordinated), but I'm open.
api/STYLE.md adherence: the proto follows the convention I see in random/v3 and least_request/v3. If there are conventions I missed, please flag.
PR status
A complete implementation is ready locally (~1600 LOC including unit tests). I'm holding back the PR until this design is ack'd per EXTENSION_POLICY.md. Once a sponsor is identified and naming/scope is agreed, I will open the PR and reference this issue.
Relevant Links:
EXTENSION_POLICY.md (this proposal follows the procedure documented there)
- Existing
random_subsetting reserved proto for context: api/envoy/extensions/load_balancing_policies/random_subsetting/v3/random_subsetting.proto
- Existing
subset LB for the closest existing analog: source/extensions/load_balancing_policies/subset/
Title: Proposal: new LB extension
envoy.load_balancing_policies.per_worker_subset(per-Envoy-worker subsetting, no upstream/xDS coordination)Description:
I'd like to propose a new load balancing policy extension —
envoy.load_balancing_policies.per_worker_subset— that gives each Envoy worker its own subset of upstream hosts and round-robins (or least-requests, etc.) within only that subset. Filing as a design proposal per EXTENSION_POLICY.md; a complete PR (~1600 LOC including unit tests, proto, build wiring, changelog) is implemented and held back pending design ack.We are running this custom subset extension in our production and have observed dramatic connection reduction (by up to x25 times). Let me know if you recommend to make this a contrib extension instead of a core extension - I'm happy to be a user sponsor.
Motivation
For Envoys fanning out to large upstream clusters (hundreds to thousands of hosts) where the upstream enforces a short server-side HTTP/1 keepalive timeout, stock per-cluster LBs (RoundRobin, LeastRequest) eventually open one connection from every worker to every upstream host. The result on real workloads is a per-Envoy upstream conn count in the hundreds of thousands (on a machine with 128 vcores), dominated by upstream-initiated closes (the backend reaping idle conns inside its keepalive window) and a high lifetime conn-create rate. The conn pool is mostly idle — neither the per-conn QPS nor the request/keepalive ratio supports keeping that many sockets warm. Operational graphs will be attached as a follow-up comment.
The root cause is structural: Envoy's per-worker connection-pool model means every worker independently maintains pools to every host, even when the per-worker QPS makes that pool size unjustifiable. Capping per-worker fanout to a subset of
Khosts collapses the conn count toW × Kand pushes each conn into the active-use band of the upstream's keepalive window.Design overview
The extension is two-axis — partitioning strategy × within-subset host-selection strategy:
Partitioning strategies
K = ceil(N/W)hosts from a stable address-sorted ordering. Partition shape is perfectly even. Starting offset is rotated per-Envoy via a hash of the bootstrap node id so the "which worker IDs are pinned to which hosts" mapping varies across the fleet; within a single Envoy, disjointness is preserved.subset_sizehosts uniformly at random from healthy hosts. Probabilistic coverage; re-samples on each membership update.Host-selection strategies (orthogonal to partitioning)
next_index_ % Kround-robin over the worker's subset. No weight handling, no slow-start.RoundRobinLoadBalancerconstructed against the worker's subset via a syntheticPrioritySet. Honors per-host weights (EDF),slow_start_config, and locality-weighted LB.LeastRequestLoadBalancer(P2C onrq_active) over the same syntheticPrioritySet.Per-worker fallback
Unlike stock LBs which apply healthy-fraction check cluster-wide, this extension applies the fallback decision per-worker, scoped to each worker's own slice. A brief deploy that flips 50% of hosts unhealthy no longer trips every worker on the same Envoy simultaneously — only the workers whose K-host slice happens to overlap a heavily-unhealthy band fall back. The other workers keep their stable healthy-only subset. This eliminates the synchronized cluster-wide conn-pool churn that a cluster-wide threshold would cause.
The
fallback_thresholdfield is in percent points[0, 100], defaults to 50, independent ofcommon_lb_config.healthy_panic_threshold(which remains scoped to the cluster).Stats
Follows the
cluster.<name>.lb_<lbname>_<event>convention (matchinglb_subsets_*on the existing subset LB):lb_per_worker_subset_rebuildslb_per_worker_subset_slice_fallbacklb_per_worker_subset_slice_empty_healthylb_per_worker_subset_empty_returnsRelationship to existing reserved
random_subsettingprotoThere's a reserved-but-unimplemented
envoy.extensions.load_balancing_policies.random_subsetting.v3package in the tree ([#not-implemented-hide:]). I want to be explicit about why I'm proposing a separate extension rather than implementing that proto:random_subsettingis a per-client model. Its semantics require the cluster'slb_subset_config(or equivalent xDS-side configuration of subset hash / metadata keys) to be set so every client deterministically picks the samesubset_sizeendpoints via rendezvous hashing. The decision is shaped by upstream cluster xDS config.per_worker_subsetis, by design, a no-touch Envoy-local decision. Each Envoy worker derives its subset purely from(W, worker_id, envoy_seed, partitioning_strategy)— no metadata keys, no client identity, no upstream cluster configuration changes, no EDS-side coordination. Switching a cluster from RoundRobin toper_worker_subsetrequires only flippingload_balancing_policyon the cluster; the upstream service owner is not involved.This makes the extension operationally cheap to roll out on existing clusters where the upstream owner cannot or will not change the EDS payload, and lets it function as a connection-pool-sizing tool independent of any client-identity hashing scheme.
The two are therefore complementary rather than overlapping:
random_subsettingsolves "every client must converge on the same K endpoints";per_worker_subsetsolves "every Envoy worker must cap its outbound conn fanout to K endpoints". Open to discussion if the maintainers see a path to merging them under a shared proto with mode discriminators, but the semantics don't collapse into a single mode.Alternatives considered
envoy.load_balancing_policies.subset)idle_timeouton Envoy's HTTP/1 connection poolrandom_subsetting(when implemented)Risks and trade-offs
common_lb_config.zone_aware_lb_configat the partition level (the inner stock LB still honors its own locality config over the subset forENVOY_ROUND_ROBIN/ENVOY_P2C).Wis resolved frombootstrap.concurrency(orhardware_concurrency()fallback). For cgroup-limited environments (Kubernetes),--concurrencymust be set explicitly orKwill be sized off host CPU rather than cgroup CPU.Quality and operational fit
extensions_metadata.yaml,extensions_build_config.bzl,api/versioning/BUILD, and changelog are all updated.N < Wdegenerate case; envoy_seed rotation; per-worker fallback for both strategies;fallback_threshold=0edge case; oversizedsubset_size; empty cluster; counter increment for fallback path.security_posture: robust_to_untrusted_downstream_and_upstream.status: alphainitially.Asks for the maintainers
@sudheerv @UNOWNED; need a real second reviewer before merge.envoy.load_balancing_policies.per_worker_subset. Alternatives considered:worker_local_subset,worker_partition. Happy to rename.random_subsettingoverlap: input on whether the proto packages should be co-located, share helper messages, or stay separate. My current view is "separate" given the semantic difference (Envoy-local vs xDS-coordinated), but I'm open.api/STYLE.mdadherence: the proto follows the convention I see inrandom/v3andleast_request/v3. If there are conventions I missed, please flag.PR status
A complete implementation is ready locally (~1600 LOC including unit tests). I'm holding back the PR until this design is ack'd per
EXTENSION_POLICY.md. Once a sponsor is identified and naming/scope is agreed, I will open the PR and reference this issue.Relevant Links:
EXTENSION_POLICY.md(this proposal follows the procedure documented there)random_subsettingreserved proto for context:api/envoy/extensions/load_balancing_policies/random_subsetting/v3/random_subsetting.protosubsetLB for the closest existing analog:source/extensions/load_balancing_policies/subset/