Skip to content

upstream: add per-worker subset load balancer#45979

Open
sudheerv wants to merge 1 commit into
envoyproxy:mainfrom
sudheerv:per-worker-subset-lb
Open

upstream: add per-worker subset load balancer#45979
sudheerv wants to merge 1 commit into
envoyproxy:mainfrom
sudheerv:per-worker-subset-lb

Conversation

@sudheerv

@sudheerv sudheerv commented Jul 5, 2026

Copy link
Copy Markdown

Design discussed in #45682.

Adds a new load balancing policy
(envoy.load_balancing_policies.per_worker_subset) that gives each Envoy worker its own subset of upstream hosts. Motivated by large upstream clusters with short server-side HTTP/1 keepalive timeouts: stock LBs eventually open a connection from every worker to every host, producing tens of thousands of idle keepalive conns the upstream reaps. This LB caps per-worker fanout to K hosts so each conn stays in active use inside the keepalive window. Fleet-wide coverage is preserved across multiple Envoys via a per-Envoy partition rotation seeded from the bootstrap node id.

The policy is two-axis:

  • partitioning_strategy selects how each worker's subset is derived:

    • EQUAL_PARTITIONS (default): deterministic non-overlapping slice of ceil(N/W) hosts from address-sorted order; per-Envoy starting offset rotation; subset_size acts as a kill-switch (>= N disables subsetting).
    • RANDOM_PARTITIONS: each worker samples subset_size hosts uniformly at random from healthy hosts.
  • host_selection_strategy selects how a host is picked within the worker's subset on each chooseHost call:

    • SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin.
    • ENVOY_ROUND_ROBIN: stock RoundRobinLoadBalancer over a synthetic PrioritySet; honors per-host weights, slow_start_config, and locality-weighted LB.
    • ENVOY_P2C: stock LeastRequestLoadBalancer (P2C on rq_active) over the same synthetic PrioritySet.

Per-worker fallback is computed against each worker's own slice rather than the cluster-wide healthy fraction, eliminating the synchronized "every worker on this Envoy flips at once" connection-pool churn that a cluster-wide threshold would cause. The fallback_threshold field (percent points, [0, 100]) is independent of the cluster's common_lb_config.healthy_panic_threshold.

Extension-owned stats follow the cluster..lb_ convention (matching the subset LB's lb_subsets* prefix): lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback, lb_per_worker_subset_slice_empty_healthy,
lb_per_worker_subset_empty_returns.

Overlap / non-overlap with the reserved
envoy.extensions.load_balancing_policies.random_subsetting.v3 proto:

random_subsetting is a per-client subsetting model. Its semantics
require the cluster's lb_subset_config (or equivalent xDS-side
configuration of the subset hash / metadata keys) to be set so that
every client deterministically picks the same subset_size endpoints
via rendezvous hashing. The decision is shaped by the upstream
cluster's xDS config.

per_worker_subset is, by design, a no-touch Envoy-local decision.
Each Envoy worker derives its own subset purely from (W, worker_id,
envoy_seed, partitioning_strategy) -- no metadata keys, no client
identity, no upstream cluster configuration changes, no EDS-side
coordination. Switching a cluster from RoundRobin to
per_worker_subset requires only flipping load_balancing_policy on
the cluster; upstream remains unchanged. This makes it operationally
cheap to roll out on existing clusters where the upstream owner
cannot or will not change the EDS payload, and makes the LB usable
as a connection-pool-sizing tool independent of any client-identity
hashing scheme.

The two extensions are therefore complementary rather than
overlapping: random_subsetting solves "every client must converge on
the same K endpoints"; per_worker_subset solves "every Envoy worker
must cap its outbound conn fanout to K endpoints". Happy to discuss
whether the proto packages should be co-located or share helper
messages, but the semantics do not collapse into a single mode.

Risk Level: Low (new extension only; no changes to existing code paths). Testing: Unit tests covering both partitioning strategies, all three
host-selection strategies, single/multi-worker degenerate cases,
envoy_seed rotation, per-worker fallback for both strategies, and
the fallback_threshold=0 edge case.
Docs Changes: New API proto includes inline protodoc; changelog entry
in changelogs/current/new_features/.
Release Notes: Yes (added under new_features).

Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #45682]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Adds a new load balancing policy
(envoy.load_balancing_policies.per_worker_subset) that gives each Envoy
worker its own subset of upstream hosts. Motivated by large upstream
clusters with short server-side HTTP/1 keepalive timeouts: stock LBs
eventually open a connection from every worker to every host, producing
tens of thousands of idle keepalive conns the upstream reaps. This LB
caps per-worker fanout to K hosts so each conn stays in active use
inside the keepalive window. Fleet-wide coverage is preserved across
multiple Envoys via a per-Envoy partition rotation seeded from the
bootstrap node id.

The policy is two-axis:

- partitioning_strategy selects how each worker's subset is derived:
  - EQUAL_PARTITIONS (default): deterministic non-overlapping slice
    of ceil(N/W) hosts from address-sorted order; per-Envoy starting
    offset rotation; subset_size acts as a kill-switch (>= N disables
    subsetting).
  - RANDOM_PARTITIONS: each worker samples subset_size hosts uniformly
    at random from healthy hosts.

- host_selection_strategy selects how a host is picked within the
  worker's subset on each chooseHost call:
  - SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin.
  - ENVOY_ROUND_ROBIN: stock RoundRobinLoadBalancer over a synthetic
    PrioritySet; honors per-host weights, slow_start_config, and
    locality-weighted LB.
  - ENVOY_P2C: stock LeastRequestLoadBalancer (P2C on rq_active) over
    the same synthetic PrioritySet.

Per-worker fallback is computed against each worker's own slice rather
than the cluster-wide healthy fraction, eliminating the synchronized
"every worker on this Envoy flips at once" connection-pool churn that
a cluster-wide threshold would cause. The fallback_threshold field
(percent points, [0, 100]) is independent of the cluster's
common_lb_config.healthy_panic_threshold.

Extension-owned stats follow the cluster.<name>.lb_<lbname>_<event>
convention (matching the subset LB's lb_subsets_* prefix):
lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback,
lb_per_worker_subset_slice_empty_healthy,
lb_per_worker_subset_empty_returns.

Overlap / non-overlap with the reserved
envoy.extensions.load_balancing_policies.random_subsetting.v3 proto:

  random_subsetting is a per-client subsetting model. Its semantics
  require the cluster's lb_subset_config (or equivalent xDS-side
  configuration of the subset hash / metadata keys) to be set so that
  every client deterministically picks the same subset_size endpoints
  via rendezvous hashing. The decision is shaped by the upstream
  cluster's xDS config.

  per_worker_subset is, by design, a no-touch Envoy-local decision.
  Each Envoy worker derives its own subset purely from (W, worker_id,
  envoy_seed, partitioning_strategy) -- no metadata keys, no client
  identity, no upstream cluster configuration changes, no EDS-side
  coordination. Switching a cluster from RoundRobin to
  per_worker_subset requires only flipping load_balancing_policy on
  the cluster; upstream remains unchanged. This makes it operationally
  cheap to roll out on existing clusters where the upstream owner
  cannot or will not change the EDS payload, and makes the LB usable
  as a connection-pool-sizing tool independent of any client-identity
  hashing scheme.

  The two extensions are therefore complementary rather than
  overlapping: random_subsetting solves "every client must converge on
  the same K endpoints"; per_worker_subset solves "every Envoy worker
  must cap its outbound conn fanout to K endpoints". Happy to discuss
  whether the proto packages should be co-located or share helper
  messages, but the semantics do not collapse into a single mode.

Risk Level: Low (new extension only; no changes to existing code paths).
Testing: Unit tests covering both partitioning strategies, all three
  host-selection strategies, single/multi-worker degenerate cases,
  envoy_seed rotation, per-worker fallback for both strategies, and
  the fallback_threshold=0 edge case.
Docs Changes: New API proto includes inline protodoc; changelog entry
  in changelogs/current/new_features/.
Release Notes: Yes (added under new_features).

Signed-off-by: sudheerv <sudheerv@apache.org>
@repokitteh-read-only

Copy link
Copy Markdown

Hi @sudheerv, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #45979 was opened by sudheerv.

see: more, trace.

@sudheerv sudheerv requested a deployment to external-contributors July 5, 2026 01:40 — with GitHub Actions Waiting
@sudheerv

sudheerv commented Jul 5, 2026

Copy link
Copy Markdown
Author

cc @jmarantz @DieBauer — you both commented on the design issue #45682 that this PR implements. Would appreciate your review when you have a chance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination)

1 participant