upstream: add per-worker subset load balancer by sudheerv · Pull Request #45979 · envoyproxy/envoy

sudheerv · 2026-07-05T01:40:52Z

Design discussed in #45682.

Adds a new load balancing policy
(envoy.load_balancing_policies.per_worker_subset) that gives each Envoy worker its own subset of upstream hosts. Motivated by large upstream clusters with short server-side HTTP/1 keepalive timeouts: stock LBs eventually open a connection from every worker to every host, producing tens of thousands of idle keepalive conns the upstream reaps. This LB caps per-worker fanout to K hosts so each conn stays in active use inside the keepalive window. Fleet-wide coverage is preserved across multiple Envoys via a per-Envoy partition rotation seeded from the bootstrap node id.

The policy is two-axis:

partitioning_strategy selects how each worker's subset is derived:
- EQUAL_PARTITIONS (default): deterministic non-overlapping slice of ceil(N/W) hosts from address-sorted order; per-Envoy starting offset rotation; subset_size acts as a kill-switch (>= N disables subsetting).
- RANDOM_PARTITIONS: each worker samples subset_size hosts uniformly at random from healthy hosts.
host_selection_strategy selects how a host is picked within the worker's subset on each chooseHost call:
- SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin.
- ENVOY_ROUND_ROBIN: stock RoundRobinLoadBalancer over a synthetic PrioritySet; honors per-host weights, slow_start_config, and locality-weighted LB.
- ENVOY_P2C: stock LeastRequestLoadBalancer (P2C on rq_active) over the same synthetic PrioritySet.

Per-worker fallback is computed against each worker's own slice rather than the cluster-wide healthy fraction, eliminating the synchronized "every worker on this Envoy flips at once" connection-pool churn that a cluster-wide threshold would cause. The fallback_threshold field (percent points, [0, 100]) is independent of the cluster's common_lb_config.healthy_panic_threshold.

Extension-owned stats follow the cluster..lb_ convention (matching the subset LB's lb_subsets* prefix): lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback, lb_per_worker_subset_slice_empty_healthy,
lb_per_worker_subset_empty_returns.

Overlap / non-overlap with the reserved
envoy.extensions.load_balancing_policies.random_subsetting.v3 proto:

random_subsetting is a per-client subsetting model. Its semantics
require the cluster's lb_subset_config (or equivalent xDS-side
configuration of the subset hash / metadata keys) to be set so that
every client deterministically picks the same subset_size endpoints
via rendezvous hashing. The decision is shaped by the upstream
cluster's xDS config.

per_worker_subset is, by design, a no-touch Envoy-local decision.
Each Envoy worker derives its own subset purely from (W, worker_id,
envoy_seed, partitioning_strategy) -- no metadata keys, no client
identity, no upstream cluster configuration changes, no EDS-side
coordination. Switching a cluster from RoundRobin to
per_worker_subset requires only flipping load_balancing_policy on
the cluster; upstream remains unchanged. This makes it operationally
cheap to roll out on existing clusters where the upstream owner
cannot or will not change the EDS payload, and makes the LB usable
as a connection-pool-sizing tool independent of any client-identity
hashing scheme.

The two extensions are therefore complementary rather than
overlapping: random_subsetting solves "every client must converge on
the same K endpoints"; per_worker_subset solves "every Envoy worker
must cap its outbound conn fanout to K endpoints". Happy to discuss
whether the proto packages should be co-located or share helper
messages, but the semantics do not collapse into a single mode.

Risk Level: Low (new extension only; no changes to existing code paths). Testing: Unit tests covering both partitioning strategies, all three
host-selection strategies, single/multi-worker degenerate cases,
envoy_seed rotation, per-worker fallback for both strategies, and
the fallback_threshold=0 edge case.
Docs Changes: New API proto includes inline protodoc; changelog entry
in changelogs/current/new_features/.
Release Notes: Yes (added under new_features).

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #45682]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Adds a new load balancing policy (envoy.load_balancing_policies.per_worker_subset) that gives each Envoy worker its own subset of upstream hosts. Motivated by large upstream clusters with short server-side HTTP/1 keepalive timeouts: stock LBs eventually open a connection from every worker to every host, producing tens of thousands of idle keepalive conns the upstream reaps. This LB caps per-worker fanout to K hosts so each conn stays in active use inside the keepalive window. Fleet-wide coverage is preserved across multiple Envoys via a per-Envoy partition rotation seeded from the bootstrap node id. The policy is two-axis: - partitioning_strategy selects how each worker's subset is derived: - EQUAL_PARTITIONS (default): deterministic non-overlapping slice of ceil(N/W) hosts from address-sorted order; per-Envoy starting offset rotation; subset_size acts as a kill-switch (>= N disables subsetting). - RANDOM_PARTITIONS: each worker samples subset_size hosts uniformly at random from healthy hosts. - host_selection_strategy selects how a host is picked within the worker's subset on each chooseHost call: - SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin. - ENVOY_ROUND_ROBIN: stock RoundRobinLoadBalancer over a synthetic PrioritySet; honors per-host weights, slow_start_config, and locality-weighted LB. - ENVOY_P2C: stock LeastRequestLoadBalancer (P2C on rq_active) over the same synthetic PrioritySet. Per-worker fallback is computed against each worker's own slice rather than the cluster-wide healthy fraction, eliminating the synchronized "every worker on this Envoy flips at once" connection-pool churn that a cluster-wide threshold would cause. The fallback_threshold field (percent points, [0, 100]) is independent of the cluster's common_lb_config.healthy_panic_threshold. Extension-owned stats follow the cluster.<name>.lb_<lbname>_<event> convention (matching the subset LB's lb_subsets_* prefix): lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback, lb_per_worker_subset_slice_empty_healthy, lb_per_worker_subset_empty_returns. Overlap / non-overlap with the reserved envoy.extensions.load_balancing_policies.random_subsetting.v3 proto: random_subsetting is a per-client subsetting model. Its semantics require the cluster's lb_subset_config (or equivalent xDS-side configuration of the subset hash / metadata keys) to be set so that every client deterministically picks the same subset_size endpoints via rendezvous hashing. The decision is shaped by the upstream cluster's xDS config. per_worker_subset is, by design, a no-touch Envoy-local decision. Each Envoy worker derives its own subset purely from (W, worker_id, envoy_seed, partitioning_strategy) -- no metadata keys, no client identity, no upstream cluster configuration changes, no EDS-side coordination. Switching a cluster from RoundRobin to per_worker_subset requires only flipping load_balancing_policy on the cluster; upstream remains unchanged. This makes it operationally cheap to roll out on existing clusters where the upstream owner cannot or will not change the EDS payload, and makes the LB usable as a connection-pool-sizing tool independent of any client-identity hashing scheme. The two extensions are therefore complementary rather than overlapping: random_subsetting solves "every client must converge on the same K endpoints"; per_worker_subset solves "every Envoy worker must cap its outbound conn fanout to K endpoints". Happy to discuss whether the proto packages should be co-located or share helper messages, but the semantics do not collapse into a single mode. Risk Level: Low (new extension only; no changes to existing code paths). Testing: Unit tests covering both partitioning strategies, all three host-selection strategies, single/multi-worker degenerate cases, envoy_seed rotation, per-worker fallback for both strategies, and the fallback_threshold=0 edge case. Docs Changes: New API proto includes inline protodoc; changelog entry in changelogs/current/new_features/. Release Notes: Yes (added under new_features). Signed-off-by: sudheerv <sudheerv@apache.org>

repokitteh-read-only · 2026-07-05T01:40:57Z

Hi @sudheerv, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #45979 was opened by sudheerv.

see: more, trace.

sudheerv · 2026-07-05T01:43:12Z

cc @jmarantz @DieBauer — you both commented on the design issue #45682 that this PR implements. Would appreciate your review when you have a chance!

sudheerv requested a deployment to external-contributors July 5, 2026 01:40 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

upstream: add per-worker subset load balancer#45979

upstream: add per-worker subset load balancer#45979
sudheerv wants to merge 1 commit into
envoyproxy:mainfrom
sudheerv:per-worker-subset-lb

sudheerv commented Jul 5, 2026 •

edited

Loading

Uh oh!

repokitteh-read-only Bot commented Jul 5, 2026

Uh oh!

sudheerv commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sudheerv commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only Bot commented Jul 5, 2026

Uh oh!

sudheerv commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sudheerv commented Jul 5, 2026 •

edited

Loading