Skip to content

Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination) #45682

Description

@sudheerv

Title: Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination)

Description:

I'd like to propose a new load balancing policy extension —
envoy.load_balancing_policies.per_worker_subset — that gives each Envoy worker its own subset of upstream hosts and round-robins (or least-requests, etc.) within only that subset. Filing as a design proposal per EXTENSION_POLICY.md; a complete PR (~1600 LOC including unit tests, proto, build wiring, changelog) is implemented and held back pending design ack.

We are running this custom subset extension in our production and have observed dramatic connection reduction (by up to x25 times). Let me know if you recommend to make this a contrib extension instead of a core extension - I'm happy to be a user sponsor.

Motivation

For Envoys fanning out to large upstream clusters (hundreds to thousands of hosts) where the upstream enforces a short server-side HTTP/1 keepalive timeout, stock per-cluster LBs (RoundRobin, LeastRequest) eventually open one connection from every worker to every upstream host. The result on real workloads is a per-Envoy upstream conn count in the hundreds of thousands (on a machine with 128 vcores), dominated by upstream-initiated closes (the backend reaping idle conns inside its keepalive window) and a high lifetime conn-create rate. The conn pool is mostly idle — neither the per-conn QPS nor the request/keepalive ratio supports keeping that many sockets warm. Operational graphs will be attached as a follow-up comment.

The root cause is structural: Envoy's per-worker connection-pool model means every worker independently maintains pools to every host, even when the per-worker QPS makes that pool size unjustifiable. Capping per-worker fanout to a subset of K hosts collapses the conn count to W × K and pushes each conn into the active-use band of the upstream's keepalive window.

Design overview

The extension is two-axis — partitioning strategy × within-subset host-selection strategy:

Partitioning strategies

  1. EQUAL_PARTITIONS (default): deterministic, non-overlapping slice of K = ceil(N/W) hosts from a stable address-sorted ordering. Partition shape is perfectly even. Starting offset is rotated per-Envoy via a hash of the bootstrap node id so the "which worker IDs are pinned to which hosts" mapping varies across the fleet; within a single Envoy, disjointness is preserved.
  2. RANDOM_PARTITIONS: each worker independently samples subset_size hosts uniformly at random from healthy hosts. Probabilistic coverage; re-samples on each membership update.

Host-selection strategies (orthogonal to partitioning)

  1. SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin over the worker's subset. No weight handling, no slow-start.
  2. ENVOY_ROUND_ROBIN: delegate to the stock RoundRobinLoadBalancer constructed against the worker's subset via a synthetic PrioritySet. Honors per-host weights (EDF), slow_start_config, and locality-weighted LB.
  3. ENVOY_P2C: delegate to the stock LeastRequestLoadBalancer (P2C on rq_active) over the same synthetic PrioritySet.

Per-worker fallback

Unlike stock LBs which apply healthy-fraction check cluster-wide, this extension applies the fallback decision per-worker, scoped to each worker's own slice. A brief deploy that flips 50% of hosts unhealthy no longer trips every worker on the same Envoy simultaneously — only the workers whose K-host slice happens to overlap a heavily-unhealthy band fall back. The other workers keep their stable healthy-only subset. This eliminates the synchronized cluster-wide conn-pool churn that a cluster-wide threshold would cause.

The fallback_threshold field is in percent points [0, 100], defaults to 50, independent of common_lb_config.healthy_panic_threshold (which remains scoped to the cluster).

Stats

Follows the cluster.<name>.lb_<lbname>_<event> convention (matching lb_subsets_* on the existing subset LB):

  • lb_per_worker_subset_rebuilds
  • lb_per_worker_subset_slice_fallback
  • lb_per_worker_subset_slice_empty_healthy
  • lb_per_worker_subset_empty_returns

Relationship to existing reserved random_subsetting proto

There's a reserved-but-unimplemented envoy.extensions.load_balancing_policies.random_subsetting.v3 package in the tree ([#not-implemented-hide:]). I want to be explicit about why I'm proposing a separate extension rather than implementing that proto:

random_subsetting is a per-client model. Its semantics require the cluster's lb_subset_config (or equivalent xDS-side configuration of subset hash / metadata keys) to be set so every client deterministically picks the same subset_size endpoints via rendezvous hashing. The decision is shaped by upstream cluster xDS config.

per_worker_subset is, by design, a no-touch Envoy-local decision. Each Envoy worker derives its subset purely from (W, worker_id, envoy_seed, partitioning_strategy) — no metadata keys, no client identity, no upstream cluster configuration changes, no EDS-side coordination. Switching a cluster from RoundRobin to per_worker_subset requires only flipping load_balancing_policy on the cluster; the upstream service owner is not involved.

This makes the extension operationally cheap to roll out on existing clusters where the upstream owner cannot or will not change the EDS payload, and lets it function as a connection-pool-sizing tool independent of any client-identity hashing scheme.

The two are therefore complementary rather than overlapping: random_subsetting solves "every client must converge on the same K endpoints"; per_worker_subset solves "every Envoy worker must cap its outbound conn fanout to K endpoints". Open to discussion if the maintainers see a path to merging them under a shared proto with mode discriminators, but the semantics don't collapse into a single mode.

Alternatives considered

Alternative Why insufficient
Existing subset LB (envoy.load_balancing_policies.subset) Requires upstream EDS metadata + matching client metadata. Doesn't solve the conn-fanout problem without upstream cooperation, and the unit is the subset definition, not the worker.
Tighter idle_timeout on Envoy's HTTP/1 connection pool Helps reduce idle conns but doesn't address fanout breadth; you still open conn-to-every-host before it idles.
random_subsetting (when implemented) Per-client semantics; needs xDS-side config; doesn't help the conn-pool-sizing case where the same client (Envoy) needs to cap its OWN per-worker fanout.
Manually sharding the cluster into N smaller clusters Operationally expensive; loses per-host-weight semantics; doesn't work when the upstream is owned by another team.

Risks and trade-offs

  • Priority-unaware. Routes only to priority 0. Clusters that need true priority routing (DEFAULT vs HIGH spillover) should keep using a different LB.
  • Locality-unaware at the partition step. Doesn't honor common_lb_config.zone_aware_lb_config at the partition level (the inner stock LB still honors its own locality config over the subset for ENVOY_ROUND_ROBIN / ENVOY_P2C).
  • EQUAL_PARTITIONS slice membership shifts when N changes (e.g., EDS adds/removes hosts). Stable health changes alone do not shift slice positions - fall back handled within a slice/per-worker subset.
  • Subset isn't sticky across Envoy restarts. Workers get a fresh partition on each restart.
  • K resolution at config-load time. Total worker count W is resolved from bootstrap.concurrency (or hardware_concurrency() fallback). For cgroup-limited environments (Kubernetes), --concurrency must be set explicitly or K will be sized off host CPU rather than cgroup CPU.

Quality and operational fit

  • All four extension files (proto + LB + factory + tests) are written, single commit, signed off. Stats follow upstream convention. CODEOWNERS, extensions_metadata.yaml, extensions_build_config.bzl, api/versioning/BUILD, and changelog are all updated.
  • Unit tests cover: both partitioning strategies; all three host-selection strategies; single/multi-worker cases; N < W degenerate case; envoy_seed rotation; per-worker fallback for both strategies; fallback_threshold=0 edge case; oversized subset_size; empty cluster; counter increment for fallback path.
  • security_posture: robust_to_untrusted_downstream_and_upstream. status: alpha initially.

Asks for the maintainers

  1. Sponsor: I'm looking for an LB-area maintainer to sponsor per EXTENSION_POLICY.md. Happy to do a synchronous walk-through if helpful.
  2. Second CODEOWNERS reviewer: my draft CODEOWNERS line currently has @sudheerv @UNOWNED; need a real second reviewer before merge.
  3. Naming: the extension is currently envoy.load_balancing_policies.per_worker_subset. Alternatives considered: worker_local_subset, worker_partition. Happy to rename.
  4. random_subsetting overlap: input on whether the proto packages should be co-located, share helper messages, or stay separate. My current view is "separate" given the semantic difference (Envoy-local vs xDS-coordinated), but I'm open.
  5. api/STYLE.md adherence: the proto follows the convention I see in random/v3 and least_request/v3. If there are conventions I missed, please flag.

PR status

A complete implementation is ready locally (~1600 LOC including unit tests). I'm holding back the PR until this design is ack'd per EXTENSION_POLICY.md. Once a sponsor is identified and naming/scope is agreed, I will open the PR and reference this issue.

Relevant Links:

  • EXTENSION_POLICY.md (this proposal follows the procedure documented there)
  • Existing random_subsetting reserved proto for context: api/envoy/extensions/load_balancing_policies/random_subsetting/v3/random_subsetting.proto
  • Existing subset LB for the closest existing analog: source/extensions/load_balancing_policies/subset/

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions