feat(tonic-xds): implement gRFC A50 outlier detection (failure-percentage) + LoadBalancer integration#2619
Conversation
Implement the core gRFC A50 outlier-detection algorithm: per-endpoint
success/failure counters, the success-rate and failure-percentage
ejection algorithms, the ejection-multiplier state machine, and a
periodic sweep task that emits ejection/un-ejection decisions on a
channel.
`run_sweep` is pure (returns a Vec<EjectionDecision>); the sweep loop
spawned by `OutlierDetector::spawn` owns the channel sender and
forwards decisions, so dropping the returned `AbortOnDrop` ends the
loop and closes the receiver. Tests drive `run_sweep` directly without
the channel or tokio time mechanics.
Algorithm coverage matches the gRFC:
- Success-rate ejection with configurable `stdev_factor`,
`enforcing_success_rate`, `minimum_hosts`, `request_volume`.
- Failure-percentage ejection with `threshold`, `enforcing_failure_
percentage`, `minimum_hosts`, `request_volume`.
- Ejection multiplier increments on each ejection, decays on healthy
intervals; ejection duration is `base * multiplier` capped at
`max(base, max_ejection_time)`.
- `max_ejection_percent` caps total concurrent ejections.
Probability rolls go through an injectable `Rng` trait (defaulting to
`fastrand`) so tests can pin enforcement decisions.
Standalone in this PR — no integration with the load balancer yet.
That lands in a follow-up alongside the per-endpoint outcome
interception layer.
Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
Address two follow-up review comments from grpc#2604 (the merged config PR) by folding the doc updates into this PR: - Module docstring: describe the actual integration plan (an mpsc channel of EjectionDecisions polled by LoadBalancer, leveraging EjectedChannel) instead of the original "filter on the Discover stream" wording. Add intra-doc links to the relevant types. - enforcing_success_rate / enforcing_failure_percentage: clarify that each is the *enforcement probability* — distinct from the per-algorithm threshold (stdev_factor for success-rate, threshold for failure-percentage). Note that 0 disables enforcement while still computing statistics. Also fix an unresolved intra-doc link in the algorithm module.
Three spec-compliance fixes to `run_sweep` and the failure-percentage
algorithm:
1. Reorder the sweep to match A50 step order: snapshot counters → run
success-rate algorithm → run failure-percentage algorithm → step-5
housekeeping (decrement non-ejected multipliers, un-eject elapsed
ejections). The previous order (un-eject before algorithms) caused
spurious `Uneject` decisions whenever the same sweep also re-ejected
the address. Per spec, re-ejection refreshes `ejected_at` to `now`
before the un-eject check runs, so no transient un-eject is emitted.
2. Drop the `total > 0` traffic gate from the multiplier-decrement
step. A50 says a non-ejected address with multiplier > 0 has its
multiplier decremented every sweep, regardless of whether it
received traffic that interval.
3. Failure-percentage now uses strict `>` against the threshold (was
`>=`). Per A50: "If the address's failure percentage is greater
than `failure_percentage_ejection.threshold`..." — an address
sitting exactly at the threshold is not ejected.
Also: drop the explicit "skip ejected hosts from candidate list" pre-
filter. Per spec the algorithms iterate every address; ejected hosts
naturally fail the `request_volume` gate since they receive no traffic
in production. Behavior on real workloads is unchanged.
Test changes:
- `re_ejection_doubles_duration` now asserts a single `Eject`
decision (no transient `Uneject`) under the corrected sweep order.
- New `failure_percentage_at_threshold_does_not_eject` covers the
strict-`>` boundary.
- New `multiplier_decrements_even_without_traffic` covers the
no-traffic-gate fix.
Drop the success-rate algorithm and its tests from this PR so the
outlier-detection PR is minimal and stand-alone. The scaffolding
(sweep loop, multiplier state, counters, max-ejection-percent budget)
is unchanged and still exercised by the failure-percentage algorithm
plus the multiplier / un-eject / cap tests.
If `OutlierDetectionConfig.success_rate` is set on the cluster, it is
currently ignored. Documented in the module docstring with a pointer
to the follow-up PR.
Removes:
- `OutlierDetector::run_success_rate` (mean / variance / sqrt math).
- `success_rate` dispatch in `run_sweep`.
- `run_failure_percentage`'s `!out.contains` filter — dead now that
only one algorithm runs per sweep.
- `success_rate_ejects_outlier_below_threshold` test.
- `success_rate_no_ejection_when_all_uniform` test.
- The `sr_config` test helper.
- Unused `SuccessRateConfig` import.
Switch from `mpsc::unbounded_channel` to `mpsc::channel(256)` for the
ejection-decision stream that the sweep loop emits.
The decisions are edge-triggered (`Eject`/`Uneject` transitions, not
state snapshots), so the consumer must process every event in order;
we can't drop or coalesce. But we don't want unbounded memory growth
either if the consumer stalls. A bounded channel gives us:
- Same correctness as unbounded — no events dropped, ordered delivery.
- Bounded memory.
- Natural backpressure: when the buffer fills, `tx.send().await`
parks the sweep task, which (combined with `MissedTickBehavior::
Skip`) throttles sweep cadence to whatever rate the consumer can
drain. Computing more decisions than the consumer can apply just
widens the desync.
Capacity is 256 — at most `2 * num_endpoints` decisions per sweep, so
this buffers several sweeps' worth of decisions for clusters of typical
size. A docstring on `DECISIONS_CHANNEL_CAPACITY` captures the
rationale for future readers.
Replace `spawn_with_rng` with `spawn_with`, taking an `OutlierDetectorOptions` struct that bundles the RNG and the new configurable `decisions_channel_capacity`. Defaults are unchanged (`fastrand` RNG, capacity 256). The hard-coded constant becomes `DEFAULT_DECISIONS_CHANNEL_CAPACITY` and is no longer the only knob — production callers may want to bump the bound for clusters with very large endpoint sets (worst case `2 * num_endpoints` decisions per sweep) or unusually slow consumers. Using a struct instead of a long argument list means future runtime knobs (custom Tokio runtime, alternate backoff policies, observability hooks, …) can be added without breaking call sites — callers typically construct via `..Default::default()`. The xDS-derived `OutlierDetectionConfig` stays separate from these host-side runtime knobs, keeping a clean line between "what the xDS proto specifies" and "how this binary chooses to host it."
…tests
Both `sweep_loop_emits_decisions_on_tick` and
`dropping_abort_stops_sweep_loop` previously used `tokio::time::sleep`
in `start_paused = true` mode. That works through the runtime's
auto-advance heuristic for parked tasks, but the heuristic is sensitive
to the order of pending wake-ups across multiple tasks and can be
flaky in practice.
- `sweep_loop_emits_decisions_on_tick`: switch to
`tokio::time::advance(150ms)` which explicitly moves the clock and
yields until pending wake-ups have been polled — deterministic.
- `dropping_abort_stops_sweep_loop`: drop the artificial sleep
altogether. Aborting the JoinHandle wakes the spawned task
synchronously; the runtime polls it, the harness observes the
abort, and the task ends — dropping its sender. `rx.recv().await`
parks briefly while that happens and then returns `None`. No time
advancement needed.
Stress-tested both tests 50× back-to-back: all pass.
Rewrite the doc comment to be reference documentation rather than a
design narrative. Drops the editorializing ("the right behavior") and
the first-person reasoning, keeps the three things a developer needs:
what the constant controls, why this size, what happens at capacity
(and why decisions can't be dropped or coalesced), and how to override.
The previous design used two separate `AtomicU64`s and snapshotted via two independent `swap` calls — the doc comment claimed this was atomic across the pair, but it isn't: an RPC completing between the two swaps inflates the next snapshot by one event, biasing the failure-percentage computation slightly under contention. Pack both counters into one `AtomicU64` (high 32 bits: successes, low 32 bits: failures). `record_*` becomes a single `fetch_add` (same hot- path cost as before), `snapshot_and_reset` becomes a single `swap(0)`, and the snapshot is now genuinely atomic across the pair — matching the bucket-swap semantics the gRFC describes. Each counter is capped at `u32::MAX` per sweep interval. Exceeding it would carry into the other counter's bits, but the cap is unreachable for realistic workloads (> 4 × 10⁹ RPCs to one endpoint within one interval). Documented on the struct.
Guard the `100 * failure / total` division against `total == 0`. gRFC A50 doesn't forbid `request_volume == 0`, in which case the qualifying filter `c.total >= request_volume` admits candidates with zero traffic; the spec is silent on `0/0`, so skip those endpoints rather than panic.
…tests Drop the test-only `sort` helper that compared `EjectionDecision`s by their `Debug` string representation, which was fragile (any change to the `Debug` impl would silently change ordering). Derive `PartialOrd` and `Ord` on `EjectionDecision` (and on `EndpointAddress` / `EndpointHost`, since the address is the inner field) and call `Vec::sort` directly at the one test site.
When an already-ejected endpoint has in-flight RPCs that complete during its ejection backoff, those completions accumulate on its counter. At the next sweep the algorithm may "re-eject" the host (refreshing its `ejected_at` timestamp and bumping the multiplier). That action does not change the count of currently-ejected addresses, so per A50's `max_ejection_percent` check it must not consume a slot in the cap — but the previous code decremented the budget for it, under-counting how many *new* ejections the cap allows. Track the pre-sweep ejection state on each `Candidate` and only decrement the budget for new ejections in the failure-percentage algorithm. Add a regression test covering the specific scenario.
Replace the spawned sweep loop + mpsc channel with an on-demand model:
the detector exposes `maybe_run_sweep(&mut self, now: Instant) -> Vec
<EjectionDecision>` and the consumer (the load balancer in a follow-up
PR) calls it from its own event loop — typically `poll_ready` —
gated by wallclock time.
This eliminates a significant amount of machinery:
- `tokio::spawn`, `sweep_loop`, `AbortOnDrop`, the mpsc channel.
- The bounded-channel capacity option, its constant, and its docs
(`OutlierDetectorOptions::decisions_channel_capacity`,
`DEFAULT_DECISIONS_CHANNEL_CAPACITY`).
- `OutlierDetectorOptions` itself — collapses to two constructors
`new(config)` and `with_rng(config, rng)`.
- The `Mutex` on `state` — the consumer's `&mut self` already
serializes access.
- Two `#[tokio::test(start_paused = true)]` tests that exercised the
spawned task and its abort handle.
Sweep timing now depends on RPC traffic: when no RPCs flow, no sweeps
run. This matches A50's intent (sweeps happen approximately every
`interval` while traffic is flowing) and is observably equivalent
because ejection only matters during endpoint picking, which only
happens during RPCs. Suggested by the PR review.
Tests:
- All algorithm-level tests rewritten to use owned `OutlierDetector`
+ `&mut self` calls, no `Mutex::lock()`, no Arc.
- Three new `maybe_run_sweep_*` tests cover the interval gate:
runs on first call, skips before interval elapsed, runs after.
- Existing failure-percentage and multiplier/un-ejection tests
unchanged in spirit; just adjusted to the new ownership model.
Pass through every doc comment and inline comment, removing rationale,
timeline language, and explanations that don't help a future reader.
Notable trims:
- Module docstring drops "Knows nothing about the data path:" framing,
the "lands in a follow-up PR" timeline (regression — flagged and
removed earlier on a different doc), and the "(mean and standard
deviation across the qualifying hosts)" parenthetical.
- `Rng` trait drops the "Abstracted so tests can inject" rationale.
- `OutlierDetector` struct drops "State is owned (no `Mutex`, no
`Arc`):" framing.
- `add_endpoint` / `remove_endpoint` / `with_rng` lose the trailing
usage hints / explanatory parentheticals.
- `maybe_run_sweep` / `run_sweep` tightened to facts-only.
- Inline comments inside `run_sweep` drop "we model that" and
"intentionally not yet dispatched in this PR" timeline.
- Inline comment for the budget-decrement guard now points at
`Candidate::already_ejected` instead of duplicating its doc.
- Test `already_ejected_re_ejection_does_not_consume_budget` drops
the "this would fail before the fix" git-history paragraph.
Eject signals previously rode an unbounded mpsc from the sweep to the LB — one event per ejection. Switch to a `watch` channel carrying the full snapshot of ejected addresses (`Arc<HashSet<EndpointAddress>>`). The sweep rebuilds and broadcasts the snapshot at end of tick, but only when the ejected-set membership actually changed, gated by an O(1) compare against an `AtomicU64` version counter incremented in `try_eject`/`note_uneject`/`remove_channel` of an ejected entry. Quiet ticks (the vast majority) skip the rebuild and broadcast entirely. On a tick that did change the set, the sweep iterates the DashMap once to collect ejected addresses and sends an `Arc` so each receiver clone is cheap. The LB reads the latest snapshot via `tokio_stream::wrappers::WatchStream::poll_next`, iterates it, and for any address still in `ready` moves it into `ejected` (with the existing remaining-ejection / past-deadline handling). Addresses that left the snapshot are echoes of the LB's own uneject path and require no action. Replaces "unbounded queue" with "snapshot-of-current-state," matching A50's mental model of "the ejected set is what matters." Quiet-tick work drops to one atomic load + compare.
OutlierStatsRegistry::record_outcome was pure delegation to the channel state — no registry-level mutation, no reason for the intermediate hop. Move the entry point to OutlierChannelState and expose ReadyChannel::record_outcome(success) that derefs the held Arc internally. LoadBalancer::call's data path no longer extracts the Arc<OutlierChannelState> or holds an Arc<OutlierStatsRegistry>; it just clones the ReadyChannel and calls record_outcome on it. Closes the last hot-path site where the LB peeked at the registry/state for RPC bookkeeping.
|
Hmm... Seems this might have added a flaky test. I got a test failure on #2685 . Re-running passed. |
I'll take a look! |
…#2689) ## Summary A flaky test was reported ([link](#2619 (comment))) after #2619 was merged. The root cause is: `tokio::time::interval(period)` fires immediately on its first poll, so the outlier-detection housekeeping actor was running `run_housekeeping` at t≈0 — before the data path had recorded anything. That early sweep calls `snapshot_and_reset` and zeros whatever counters have landed. Therefore, `test_outlier_detection_reinsert_preserves_state` flakes with `(0, 0) != (3, 0)` when the actor's first tick races the assertion read. Fix: consume the immediate tick before entering the loop so the first real sweep fires one full `interval` after spawn, matching the gRFC A50 semantics. ## Testing \`cargo test -p tonic-xds --lib test_outlier_detection_reinsert_preserves_state\` — 10/10 pass under stress.
|
I found a flake on master where the logs are still intact. (But since you've fixed the flake, this doesn't really matter now.) |
Yeah, I'm looking at it now, and looks like there is another flaky test "test_web::grpc smoke_server_stream" https://github.com/grpc/grpc-rust/actions/runs/27624202442/job/81681297096 . Will keep monitoring. |
|
Yeah, I noticed the |
Summary
Implements gRFC A50: xDS Outlier Detection (failure-percentage algorithm) in
tonic-xdsand integrates it intoLoadBalancer. Config types landed in #2604.Architecture
OutlierStatsRegistry::record_outcome, called fromLoadBalancer::call) — increments the picked channel's success/failure counter. Nothing else.OutlierStatsRegistry::run_housekeeping, called by the housekeeping actor on eachconfig.intervaltick) — snapshots all channel counters, runs the failure-percentage algorithm against the snapshot population (applyingminimum_hosts,max_ejection_percent, threshold, enforcement roll), dispatches eject addresses on an mpsc, then resets counters and decrements multipliers for non-ejected channels.poll_ready, ejects viaReadyChannel::eject, tracks the resultingEjectedChannelinKeyedFutures<_, UnejectedChannel<_>>. The picker only seesready, so ejected channels are unpickable by construction.Un-ejection is timer-driven per channel: each
EjectedChannel'sSleepfires atmin(base × multiplier, max(base, max_ejection_time))and yields anUnejectedChannel; the LB routes the resolved channel back toready.Constructor interface
LoadBalancer::newtakesArc<ArcSwap<OutlierDetectionConfig>>.OutlierDetectionConfig::default()is the disabled config (both algorithmsNone) — no actor spawned,record_outcomeshort-circuits at the per-state counter increment. TheArcSwapshape reserves the slot for the future xDS-driven config-update path.A50 compliance
>against the threshold.max_ejection_percentfloors at 1 for non-empty pools (spec: "will eject at least one address regardless of the value").Change::Insertfor an already-tracked address.Deferred
ArcSwap::storeis supported but the actor doesn't observe swaps yet).ClusterResourceinto LB construction.Test plan
cargo test -p tonic-xds --lib --all-features— 324 lib tests passcargo fmt -p tonic-xdscargo clippyclean on changed files