Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination)

*Title*: Proposal: new LB extension `envoy.load_balancing_policies.per_worker_subset` (per-Envoy-worker subsetting, no upstream/xDS coordination)

*Description*:

I'd like to propose a new load balancing policy extension —
`envoy.load_balancing_policies.per_worker_subset` — that gives each Envoy worker its own subset of upstream hosts and round-robins (or least-requests, etc.) within only that subset. Filing as a design proposal per [EXTENSION_POLICY.md](https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md); a complete PR (~1600 LOC including unit tests, proto, build wiring, changelog) is implemented and held back pending design ack.

We are running this custom subset extension in our production and have observed dramatic connection reduction (by up to x25 times). Let me know if you recommend to make this a contrib extension instead of a core extension - I'm happy to be a user sponsor.

## Motivation

For Envoys fanning out to large upstream clusters (hundreds to thousands of hosts) where the upstream enforces a short server-side HTTP/1 keepalive timeout, stock per-cluster LBs (RoundRobin, LeastRequest) eventually open one connection from every worker to every upstream host. The result on real workloads is a per-Envoy upstream conn count in the hundreds of thousands (on a machine with 128 vcores), dominated by upstream-initiated closes (the backend reaping idle conns inside its keepalive window) and a high lifetime conn-create rate. The conn pool is mostly idle — neither the per-conn QPS nor the request/keepalive ratio supports keeping that many sockets warm. Operational graphs will be attached as a follow-up comment.

The root cause is structural: Envoy's per-worker connection-pool model means every worker independently maintains pools to every host, even when the per-worker QPS makes that pool size unjustifiable. Capping per-worker fanout to a subset of `K` hosts collapses the conn count to `W × K` and pushes each conn into the active-use band of the upstream's keepalive window.

## Design overview

The extension is **two-axis** — partitioning strategy × within-subset host-selection strategy:

### Partitioning strategies

1. **EQUAL_PARTITIONS** (default): deterministic, non-overlapping slice of `K = ceil(N/W)` hosts from a stable address-sorted ordering. Partition shape is perfectly even. Starting offset is rotated per-Envoy via a hash of the bootstrap node id so the "which worker IDs are pinned to which hosts" mapping varies across the fleet; within a single Envoy, disjointness is preserved.
2. **RANDOM_PARTITIONS**: each worker independently samples `subset_size` hosts uniformly at random from healthy hosts. Probabilistic coverage; re-samples on each membership update.

### Host-selection strategies (orthogonal to partitioning)

1. **SIMPLE_ROUND_ROBIN**: inline `next_index_ % K` round-robin over the worker's subset. No weight handling, no slow-start.
2. **ENVOY_ROUND_ROBIN**: delegate to the stock `RoundRobinLoadBalancer` constructed against the worker's subset via a synthetic `PrioritySet`. Honors per-host weights (EDF), `slow_start_config`, and locality-weighted LB.
3. **ENVOY_P2C**: delegate to the stock `LeastRequestLoadBalancer` (P2C on `rq_active`) over the same synthetic `PrioritySet`.

### Per-worker fallback

Unlike stock LBs which apply healthy-fraction check cluster-wide, this extension applies the fallback decision **per-worker, scoped to each worker's own slice**. A brief deploy that flips 50% of hosts unhealthy no longer trips every worker on the same Envoy simultaneously — only the workers whose K-host slice happens to overlap a heavily-unhealthy band fall back. The other workers keep their stable healthy-only subset. This eliminates the synchronized cluster-wide conn-pool churn that a cluster-wide threshold would cause.

The `fallback_threshold` field is in percent points `[0, 100]`, defaults to 50, independent of `common_lb_config.healthy_panic_threshold` (which remains scoped to the cluster).

### Stats

Follows the `cluster.<name>.lb_<lbname>_<event>` convention (matching `lb_subsets_*` on the existing subset LB):
- `lb_per_worker_subset_rebuilds`
- `lb_per_worker_subset_slice_fallback`
- `lb_per_worker_subset_slice_empty_healthy`
- `lb_per_worker_subset_empty_returns`

## Relationship to existing reserved `random_subsetting` proto

There's a reserved-but-unimplemented `envoy.extensions.load_balancing_policies.random_subsetting.v3` package in the tree (`[#not-implemented-hide:]`). I want to be explicit about why I'm proposing a separate extension rather than implementing that proto:

**`random_subsetting` is a per-client model.** Its semantics require the cluster's `lb_subset_config` (or equivalent xDS-side configuration of subset hash / metadata keys) to be set so every client deterministically picks the same `subset_size` endpoints via rendezvous hashing. The decision is shaped by upstream cluster xDS config.

**`per_worker_subset` is, by design, a no-touch Envoy-local decision.** Each Envoy worker derives its subset purely from `(W, worker_id, envoy_seed, partitioning_strategy)` — no metadata keys, no client identity, no upstream cluster configuration changes, no EDS-side coordination. Switching a cluster from RoundRobin to `per_worker_subset` requires only flipping `load_balancing_policy` on the cluster; the upstream service owner is not involved.

This makes the extension operationally cheap to roll out on existing clusters where the upstream owner cannot or will not change the EDS payload, and lets it function as a connection-pool-sizing tool independent of any client-identity hashing scheme.

The two are therefore complementary rather than overlapping: `random_subsetting` solves *"every client must converge on the same K endpoints"*; `per_worker_subset` solves *"every Envoy worker must cap its outbound conn fanout to K endpoints"*. Open to discussion if the maintainers see a path to merging them under a shared proto with mode discriminators, but the semantics don't collapse into a single mode.

## Alternatives considered

| Alternative | Why insufficient |
|---|---|
| Existing subset LB (`envoy.load_balancing_policies.subset`) | Requires upstream EDS metadata + matching client metadata. Doesn't solve the conn-fanout problem without upstream cooperation, and the unit is the subset definition, not the worker. |
| Tighter `idle_timeout` on Envoy's HTTP/1 connection pool | Helps reduce idle conns but doesn't address fanout breadth; you still open conn-to-every-host before it idles. |
| `random_subsetting` (when implemented) | Per-client semantics; needs xDS-side config; doesn't help the conn-pool-sizing case where the same client (Envoy) needs to cap its OWN per-worker fanout. |
| Manually sharding the cluster into N smaller clusters | Operationally expensive; loses per-host-weight semantics; doesn't work when the upstream is owned by another team. |

## Risks and trade-offs

- **Priority-unaware.** Routes only to priority 0. Clusters that need true priority routing (DEFAULT vs HIGH spillover) should keep using a different LB.
- **Locality-unaware at the partition step.** Doesn't honor `common_lb_config.zone_aware_lb_config` at the partition level (the inner stock LB still honors its own locality config over the subset for `ENVOY_ROUND_ROBIN` / `ENVOY_P2C`).
- **EQUAL_PARTITIONS slice membership shifts when N changes** (e.g., EDS adds/removes hosts). Stable health changes alone do not shift slice positions - fall back handled within a slice/per-worker subset.
- **Subset isn't sticky across Envoy restarts.** Workers get a fresh partition on each restart.
- **K resolution at config-load time.** Total worker count `W` is resolved from `bootstrap.concurrency` (or `hardware_concurrency()` fallback). For cgroup-limited environments (Kubernetes), `--concurrency` must be set explicitly or `K` will be sized off host CPU rather than cgroup CPU.

## Quality and operational fit

- All four extension files (proto + LB + factory + tests) are written, single commit, signed off. Stats follow upstream convention. CODEOWNERS, `extensions_metadata.yaml`, `extensions_build_config.bzl`, `api/versioning/BUILD`, and changelog are all updated.
- Unit tests cover: both partitioning strategies; all three host-selection strategies; single/multi-worker cases; `N < W` degenerate case; envoy_seed rotation; per-worker fallback for both strategies; `fallback_threshold=0` edge case; oversized `subset_size`; empty cluster; counter increment for fallback path.
- `security_posture: robust_to_untrusted_downstream_and_upstream`. `status: alpha` initially.

## Asks for the maintainers

1. **Sponsor**: I'm looking for an LB-area maintainer to sponsor per [EXTENSION_POLICY.md](https://github.com/envoyproxy/envoy/blob/main/EXTENSION_POLICY.md). Happy to do a synchronous walk-through if helpful.
2. **Second CODEOWNERS reviewer**: my draft CODEOWNERS line currently has `@sudheerv @UNOWNED`; need a real second reviewer before merge.
3. **Naming**: the extension is currently `envoy.load_balancing_policies.per_worker_subset`. Alternatives considered: `worker_local_subset`, `worker_partition`. Happy to rename.
4. **`random_subsetting` overlap**: input on whether the proto packages should be co-located, share helper messages, or stay separate. My current view is "separate" given the semantic difference (Envoy-local vs xDS-coordinated), but I'm open.
5. **`api/STYLE.md` adherence**: the proto follows the convention I see in `random/v3` and `least_request/v3`. If there are conventions I missed, please flag.

## PR status

A complete implementation is ready locally (~1600 LOC including unit tests). I'm holding back the PR until this design is ack'd per `EXTENSION_POLICY.md`. Once a sponsor is identified and naming/scope is agreed, I will open the PR and reference this issue.

*Relevant Links*:
- `EXTENSION_POLICY.md` (this proposal follows the procedure documented there)
- Existing `random_subsetting` reserved proto for context: `api/envoy/extensions/load_balancing_policies/random_subsetting/v3/random_subsetting.proto`
- Existing `subset` LB for the closest existing analog: `source/extensions/load_balancing_policies/subset/`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination) #45682

Motivation

Design overview

Partitioning strategies

Host-selection strategies (orthogonal to partitioning)

Per-worker fallback

Stats

Relationship to existing reserved `random_subsetting` proto

Alternatives considered

Risks and trade-offs

Quality and operational fit

Asks for the maintainers

PR status

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Alternative	Why insufficient
Existing subset LB (`envoy.load_balancing_policies.subset`)	Requires upstream EDS metadata + matching client metadata. Doesn't solve the conn-fanout problem without upstream cooperation, and the unit is the subset definition, not the worker.
Tighter `idle_timeout` on Envoy's HTTP/1 connection pool	Helps reduce idle conns but doesn't address fanout breadth; you still open conn-to-every-host before it idles.
`random_subsetting` (when implemented)	Per-client semantics; needs xDS-side config; doesn't help the conn-pool-sizing case where the same client (Envoy) needs to cap its OWN per-worker fanout.
Manually sharding the cluster into N smaller clusters	Operationally expensive; loses per-host-weight semantics; doesn't work when the upstream is owned by another team.

Uh oh!

Proposal: new LB extension envoy.load_balancing_policies.per_worker_subset (per-Envoy-worker subsetting, no upstream/xDS coordination) #45682

Description

Motivation

Design overview

Partitioning strategies

Host-selection strategies (orthogonal to partitioning)

Per-worker fallback

Stats

Relationship to existing reserved random_subsetting proto

Alternatives considered

Risks and trade-offs

Quality and operational fit

Asks for the maintainers

PR status

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Relationship to existing reserved `random_subsetting` proto