upstream: add per-worker subset load balancer#45979
Open
sudheerv wants to merge 1 commit into
Open
Conversation
Adds a new load balancing policy
(envoy.load_balancing_policies.per_worker_subset) that gives each Envoy
worker its own subset of upstream hosts. Motivated by large upstream
clusters with short server-side HTTP/1 keepalive timeouts: stock LBs
eventually open a connection from every worker to every host, producing
tens of thousands of idle keepalive conns the upstream reaps. This LB
caps per-worker fanout to K hosts so each conn stays in active use
inside the keepalive window. Fleet-wide coverage is preserved across
multiple Envoys via a per-Envoy partition rotation seeded from the
bootstrap node id.
The policy is two-axis:
- partitioning_strategy selects how each worker's subset is derived:
- EQUAL_PARTITIONS (default): deterministic non-overlapping slice
of ceil(N/W) hosts from address-sorted order; per-Envoy starting
offset rotation; subset_size acts as a kill-switch (>= N disables
subsetting).
- RANDOM_PARTITIONS: each worker samples subset_size hosts uniformly
at random from healthy hosts.
- host_selection_strategy selects how a host is picked within the
worker's subset on each chooseHost call:
- SIMPLE_ROUND_ROBIN: inline next_index_ % K round-robin.
- ENVOY_ROUND_ROBIN: stock RoundRobinLoadBalancer over a synthetic
PrioritySet; honors per-host weights, slow_start_config, and
locality-weighted LB.
- ENVOY_P2C: stock LeastRequestLoadBalancer (P2C on rq_active) over
the same synthetic PrioritySet.
Per-worker fallback is computed against each worker's own slice rather
than the cluster-wide healthy fraction, eliminating the synchronized
"every worker on this Envoy flips at once" connection-pool churn that
a cluster-wide threshold would cause. The fallback_threshold field
(percent points, [0, 100]) is independent of the cluster's
common_lb_config.healthy_panic_threshold.
Extension-owned stats follow the cluster.<name>.lb_<lbname>_<event>
convention (matching the subset LB's lb_subsets_* prefix):
lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback,
lb_per_worker_subset_slice_empty_healthy,
lb_per_worker_subset_empty_returns.
Overlap / non-overlap with the reserved
envoy.extensions.load_balancing_policies.random_subsetting.v3 proto:
random_subsetting is a per-client subsetting model. Its semantics
require the cluster's lb_subset_config (or equivalent xDS-side
configuration of the subset hash / metadata keys) to be set so that
every client deterministically picks the same subset_size endpoints
via rendezvous hashing. The decision is shaped by the upstream
cluster's xDS config.
per_worker_subset is, by design, a no-touch Envoy-local decision.
Each Envoy worker derives its own subset purely from (W, worker_id,
envoy_seed, partitioning_strategy) -- no metadata keys, no client
identity, no upstream cluster configuration changes, no EDS-side
coordination. Switching a cluster from RoundRobin to
per_worker_subset requires only flipping load_balancing_policy on
the cluster; upstream remains unchanged. This makes it operationally
cheap to roll out on existing clusters where the upstream owner
cannot or will not change the EDS payload, and makes the LB usable
as a connection-pool-sizing tool independent of any client-identity
hashing scheme.
The two extensions are therefore complementary rather than
overlapping: random_subsetting solves "every client must converge on
the same K endpoints"; per_worker_subset solves "every Envoy worker
must cap its outbound conn fanout to K endpoints". Happy to discuss
whether the proto packages should be co-located or share helper
messages, but the semantics do not collapse into a single mode.
Risk Level: Low (new extension only; no changes to existing code paths).
Testing: Unit tests covering both partitioning strategies, all three
host-selection strategies, single/multi-worker degenerate cases,
envoy_seed rotation, per-worker fallback for both strategies, and
the fallback_threshold=0 edge case.
Docs Changes: New API proto includes inline protodoc; changelog entry
in changelogs/current/new_features/.
Release Notes: Yes (added under new_features).
Signed-off-by: sudheerv <sudheerv@apache.org>
|
Hi @sudheerv, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design discussed in #45682.
Adds a new load balancing policy
(envoy.load_balancing_policies.per_worker_subset) that gives each Envoy worker its own subset of upstream hosts. Motivated by large upstream clusters with short server-side HTTP/1 keepalive timeouts: stock LBs eventually open a connection from every worker to every host, producing tens of thousands of idle keepalive conns the upstream reaps. This LB caps per-worker fanout to K hosts so each conn stays in active use inside the keepalive window. Fleet-wide coverage is preserved across multiple Envoys via a per-Envoy partition rotation seeded from the bootstrap node id.
The policy is two-axis:
partitioning_strategy selects how each worker's subset is derived:
host_selection_strategy selects how a host is picked within the worker's subset on each chooseHost call:
Per-worker fallback is computed against each worker's own slice rather than the cluster-wide healthy fraction, eliminating the synchronized "every worker on this Envoy flips at once" connection-pool churn that a cluster-wide threshold would cause. The fallback_threshold field (percent points, [0, 100]) is independent of the cluster's common_lb_config.healthy_panic_threshold.
Extension-owned stats follow the cluster..lb_ convention (matching the subset LB's lb_subsets* prefix): lb_per_worker_subset_rebuilds, lb_per_worker_subset_slice_fallback, lb_per_worker_subset_slice_empty_healthy,
lb_per_worker_subset_empty_returns.
Overlap / non-overlap with the reserved
envoy.extensions.load_balancing_policies.random_subsetting.v3 proto:
random_subsetting is a per-client subsetting model. Its semantics
require the cluster's lb_subset_config (or equivalent xDS-side
configuration of the subset hash / metadata keys) to be set so that
every client deterministically picks the same subset_size endpoints
via rendezvous hashing. The decision is shaped by the upstream
cluster's xDS config.
per_worker_subset is, by design, a no-touch Envoy-local decision.
Each Envoy worker derives its own subset purely from (W, worker_id,
envoy_seed, partitioning_strategy) -- no metadata keys, no client
identity, no upstream cluster configuration changes, no EDS-side
coordination. Switching a cluster from RoundRobin to
per_worker_subset requires only flipping load_balancing_policy on
the cluster; upstream remains unchanged. This makes it operationally
cheap to roll out on existing clusters where the upstream owner
cannot or will not change the EDS payload, and makes the LB usable
as a connection-pool-sizing tool independent of any client-identity
hashing scheme.
The two extensions are therefore complementary rather than
overlapping: random_subsetting solves "every client must converge on
the same K endpoints"; per_worker_subset solves "every Envoy worker
must cap its outbound conn fanout to K endpoints". Happy to discuss
whether the proto packages should be co-located or share helper
messages, but the semantics do not collapse into a single mode.
Risk Level: Low (new extension only; no changes to existing code paths). Testing: Unit tests covering both partitioning strategies, all three
host-selection strategies, single/multi-worker degenerate cases,
envoy_seed rotation, per-worker fallback for both strategies, and
the fallback_threshold=0 edge case.
Docs Changes: New API proto includes inline protodoc; changelog entry
in changelogs/current/new_features/.
Release Notes: Yes (added under new_features).
Commit Message:
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #45682]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]