Skip to content

feat(tonic-xds): implement gRFC A50 outlier detection (failure-percentage) + LoadBalancer integration#2619

Merged
gu0keno0 merged 56 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector
Jun 3, 2026
Merged

feat(tonic-xds): implement gRFC A50 outlier detection (failure-percentage) + LoadBalancer integration#2619
gu0keno0 merged 56 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector

Conversation

@LYZJU2019

@LYZJU2019 LYZJU2019 commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements gRFC A50: xDS Outlier Detection (failure-percentage algorithm) in tonic-xds and integrates it into LoadBalancer. Config types landed in #2604.

Architecture

  • Data path (OutlierStatsRegistry::record_outcome, called from LoadBalancer::call) — increments the picked channel's success/failure counter. Nothing else.
  • Sweep (OutlierStatsRegistry::run_housekeeping, called by the housekeeping actor on each config.interval tick) — snapshots all channel counters, runs the failure-percentage algorithm against the snapshot population (applying minimum_hosts, max_ejection_percent, threshold, enforcement roll), dispatches eject addresses on an mpsc, then resets counters and decrements multipliers for non-ejected channels.
  • Load balancer — drains the eject mpsc in poll_ready, ejects via ReadyChannel::eject, tracks the resulting EjectedChannel in KeyedFutures<_, UnejectedChannel<_>>. The picker only sees ready, so ejected channels are unpickable by construction.

Un-ejection is timer-driven per channel: each EjectedChannel's Sleep fires at min(base × multiplier, max(base, max_ejection_time)) and yields an UnejectedChannel; the LB routes the resolved channel back to ready.

Constructor interface

LoadBalancer::new takes Arc<ArcSwap<OutlierDetectionConfig>>. OutlierDetectionConfig::default() is the disabled config (both algorithms None) — no actor spawned, record_outcome short-circuits at the per-state counter increment. The ArcSwap shape reserves the slot for the future xDS-driven config-update path.

A50 compliance

  • Algorithm runs at the interval sweep, not per RPC (§6).
  • Failure-percentage uses strict > against the threshold.
  • Multiplier decrements at the same transition that un-ejects (§6.b).
  • max_ejection_percent floors at 1 for non-empty pools (spec: "will eject at least one address regardless of the value").
  • Outlier state survives Change::Insert for an already-tracked address.

Deferred

  • Success-rate algorithm.
  • Live config-update plumbing (ArcSwap::store is supported but the actor doesn't observe swaps yet).
  • Wiring from ClusterResource into LB construction.

Test plan

  • cargo test -p tonic-xds --lib --all-features — 324 lib tests pass
  • cargo fmt -p tonic-xds
  • cargo clippy clean on changed files

Implement the core gRFC A50 outlier-detection algorithm: per-endpoint
success/failure counters, the success-rate and failure-percentage
ejection algorithms, the ejection-multiplier state machine, and a
periodic sweep task that emits ejection/un-ejection decisions on a
channel.

`run_sweep` is pure (returns a Vec<EjectionDecision>); the sweep loop
spawned by `OutlierDetector::spawn` owns the channel sender and
forwards decisions, so dropping the returned `AbortOnDrop` ends the
loop and closes the receiver. Tests drive `run_sweep` directly without
the channel or tokio time mechanics.

Algorithm coverage matches the gRFC:
  - Success-rate ejection with configurable `stdev_factor`,
    `enforcing_success_rate`, `minimum_hosts`, `request_volume`.
  - Failure-percentage ejection with `threshold`, `enforcing_failure_
    percentage`, `minimum_hosts`, `request_volume`.
  - Ejection multiplier increments on each ejection, decays on healthy
    intervals; ejection duration is `base * multiplier` capped at
    `max(base, max_ejection_time)`.
  - `max_ejection_percent` caps total concurrent ejections.

Probability rolls go through an injectable `Rng` trait (defaulting to
`fastrand`) so tests can pin enforcement decisions.

Standalone in this PR — no integration with the load balancer yet.
That lands in a follow-up alongside the per-endpoint outcome
interception layer.

Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
Address two follow-up review comments from grpc#2604 (the merged config
PR) by folding the doc updates into this PR:

- Module docstring: describe the actual integration plan (an mpsc
  channel of EjectionDecisions polled by LoadBalancer, leveraging
  EjectedChannel) instead of the original "filter on the Discover
  stream" wording. Add intra-doc links to the relevant types.

- enforcing_success_rate / enforcing_failure_percentage: clarify
  that each is the *enforcement probability* — distinct from the
  per-algorithm threshold (stdev_factor for success-rate, threshold
  for failure-percentage). Note that 0 disables enforcement while
  still computing statistics.

Also fix an unresolved intra-doc link in the algorithm module.
Three spec-compliance fixes to `run_sweep` and the failure-percentage
algorithm:

1. Reorder the sweep to match A50 step order: snapshot counters → run
   success-rate algorithm → run failure-percentage algorithm → step-5
   housekeeping (decrement non-ejected multipliers, un-eject elapsed
   ejections). The previous order (un-eject before algorithms) caused
   spurious `Uneject` decisions whenever the same sweep also re-ejected
   the address. Per spec, re-ejection refreshes `ejected_at` to `now`
   before the un-eject check runs, so no transient un-eject is emitted.

2. Drop the `total > 0` traffic gate from the multiplier-decrement
   step. A50 says a non-ejected address with multiplier > 0 has its
   multiplier decremented every sweep, regardless of whether it
   received traffic that interval.

3. Failure-percentage now uses strict `>` against the threshold (was
   `>=`). Per A50: "If the address's failure percentage is greater
   than `failure_percentage_ejection.threshold`..." — an address
   sitting exactly at the threshold is not ejected.

Also: drop the explicit "skip ejected hosts from candidate list" pre-
filter. Per spec the algorithms iterate every address; ejected hosts
naturally fail the `request_volume` gate since they receive no traffic
in production. Behavior on real workloads is unchanged.

Test changes:
  - `re_ejection_doubles_duration` now asserts a single `Eject`
    decision (no transient `Uneject`) under the corrected sweep order.
  - New `failure_percentage_at_threshold_does_not_eject` covers the
    strict-`>` boundary.
  - New `multiplier_decrements_even_without_traffic` covers the
    no-traffic-gate fix.
Drop the success-rate algorithm and its tests from this PR so the
outlier-detection PR is minimal and stand-alone. The scaffolding
(sweep loop, multiplier state, counters, max-ejection-percent budget)
is unchanged and still exercised by the failure-percentage algorithm
plus the multiplier / un-eject / cap tests.

If `OutlierDetectionConfig.success_rate` is set on the cluster, it is
currently ignored. Documented in the module docstring with a pointer
to the follow-up PR.

Removes:
  - `OutlierDetector::run_success_rate` (mean / variance / sqrt math).
  - `success_rate` dispatch in `run_sweep`.
  - `run_failure_percentage`'s `!out.contains` filter — dead now that
    only one algorithm runs per sweep.
  - `success_rate_ejects_outlier_below_threshold` test.
  - `success_rate_no_ejection_when_all_uniform` test.
  - The `sr_config` test helper.
  - Unused `SuccessRateConfig` import.
@LYZJU2019 LYZJU2019 changed the title feat(tonic-xds): add OutlierDetector sweep engine (gRFC A50) feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50) Apr 30, 2026
Switch from `mpsc::unbounded_channel` to `mpsc::channel(256)` for the
ejection-decision stream that the sweep loop emits.

The decisions are edge-triggered (`Eject`/`Uneject` transitions, not
state snapshots), so the consumer must process every event in order;
we can't drop or coalesce. But we don't want unbounded memory growth
either if the consumer stalls. A bounded channel gives us:

  - Same correctness as unbounded — no events dropped, ordered delivery.
  - Bounded memory.
  - Natural backpressure: when the buffer fills, `tx.send().await`
    parks the sweep task, which (combined with `MissedTickBehavior::
    Skip`) throttles sweep cadence to whatever rate the consumer can
    drain. Computing more decisions than the consumer can apply just
    widens the desync.

Capacity is 256 — at most `2 * num_endpoints` decisions per sweep, so
this buffers several sweeps' worth of decisions for clusters of typical
size. A docstring on `DECISIONS_CHANNEL_CAPACITY` captures the
rationale for future readers.
Replace `spawn_with_rng` with `spawn_with`, taking an
`OutlierDetectorOptions` struct that bundles the RNG and the new
configurable `decisions_channel_capacity`. Defaults are unchanged
(`fastrand` RNG, capacity 256).

The hard-coded constant becomes `DEFAULT_DECISIONS_CHANNEL_CAPACITY`
and is no longer the only knob — production callers may want to bump
the bound for clusters with very large endpoint sets (worst case
`2 * num_endpoints` decisions per sweep) or unusually slow consumers.

Using a struct instead of a long argument list means future runtime
knobs (custom Tokio runtime, alternate backoff policies, observability
hooks, …) can be added without breaking call sites — callers typically
construct via `..Default::default()`.

The xDS-derived `OutlierDetectionConfig` stays separate from these
host-side runtime knobs, keeping a clean line between "what the xDS
proto specifies" and "how this binary chooses to host it."
…tests

Both `sweep_loop_emits_decisions_on_tick` and
`dropping_abort_stops_sweep_loop` previously used `tokio::time::sleep`
in `start_paused = true` mode. That works through the runtime's
auto-advance heuristic for parked tasks, but the heuristic is sensitive
to the order of pending wake-ups across multiple tasks and can be
flaky in practice.

  - `sweep_loop_emits_decisions_on_tick`: switch to
    `tokio::time::advance(150ms)` which explicitly moves the clock and
    yields until pending wake-ups have been polled — deterministic.
  - `dropping_abort_stops_sweep_loop`: drop the artificial sleep
    altogether. Aborting the JoinHandle wakes the spawned task
    synchronously; the runtime polls it, the harness observes the
    abort, and the task ends — dropping its sender. `rx.recv().await`
    parks briefly while that happens and then returns `None`. No time
    advancement needed.

Stress-tested both tests 50× back-to-back: all pass.
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
LYZJU2019 added 4 commits May 4, 2026 10:29
Rewrite the doc comment to be reference documentation rather than a
design narrative. Drops the editorializing ("the right behavior") and
the first-person reasoning, keeps the three things a developer needs:
what the constant controls, why this size, what happens at capacity
(and why decisions can't be dropped or coalesced), and how to override.
The previous design used two separate `AtomicU64`s and snapshotted via
two independent `swap` calls — the doc comment claimed this was atomic
across the pair, but it isn't: an RPC completing between the two swaps
inflates the next snapshot by one event, biasing the failure-percentage
computation slightly under contention.

Pack both counters into one `AtomicU64` (high 32 bits: successes, low
32 bits: failures). `record_*` becomes a single `fetch_add` (same hot-
path cost as before), `snapshot_and_reset` becomes a single `swap(0)`,
and the snapshot is now genuinely atomic across the pair — matching
the bucket-swap semantics the gRFC describes.

Each counter is capped at `u32::MAX` per sweep interval. Exceeding it
would carry into the other counter's bits, but the cap is unreachable
for realistic workloads (> 4 × 10⁹ RPCs to one endpoint within one
interval). Documented on the struct.
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Guard the `100 * failure / total` division against `total == 0`.
gRFC A50 doesn't forbid `request_volume == 0`, in which case the
qualifying filter `c.total >= request_volume` admits candidates with
zero traffic; the spec is silent on `0/0`, so skip those endpoints
rather than panic.
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
LYZJU2019 and others added 6 commits May 4, 2026 15:27
…tests

Drop the test-only `sort` helper that compared `EjectionDecision`s by
their `Debug` string representation, which was fragile (any change to
the `Debug` impl would silently change ordering). Derive `PartialOrd`
and `Ord` on `EjectionDecision` (and on `EndpointAddress` /
`EndpointHost`, since the address is the inner field) and call
`Vec::sort` directly at the one test site.
When an already-ejected endpoint has in-flight RPCs that complete
during its ejection backoff, those completions accumulate on its
counter. At the next sweep the algorithm may "re-eject" the host
(refreshing its `ejected_at` timestamp and bumping the multiplier).
That action does not change the count of currently-ejected addresses,
so per A50's `max_ejection_percent` check it must not consume a slot
in the cap — but the previous code decremented the budget for it,
under-counting how many *new* ejections the cap allows.

Track the pre-sweep ejection state on each `Candidate` and only
decrement the budget for new ejections in the failure-percentage
algorithm. Add a regression test covering the specific scenario.
Replace the spawned sweep loop + mpsc channel with an on-demand model:
the detector exposes `maybe_run_sweep(&mut self, now: Instant) -> Vec
<EjectionDecision>` and the consumer (the load balancer in a follow-up
PR) calls it from its own event loop — typically `poll_ready` —
gated by wallclock time.

This eliminates a significant amount of machinery:
  - `tokio::spawn`, `sweep_loop`, `AbortOnDrop`, the mpsc channel.
  - The bounded-channel capacity option, its constant, and its docs
    (`OutlierDetectorOptions::decisions_channel_capacity`,
    `DEFAULT_DECISIONS_CHANNEL_CAPACITY`).
  - `OutlierDetectorOptions` itself — collapses to two constructors
    `new(config)` and `with_rng(config, rng)`.
  - The `Mutex` on `state` — the consumer's `&mut self` already
    serializes access.
  - Two `#[tokio::test(start_paused = true)]` tests that exercised the
    spawned task and its abort handle.

Sweep timing now depends on RPC traffic: when no RPCs flow, no sweeps
run. This matches A50's intent (sweeps happen approximately every
`interval` while traffic is flowing) and is observably equivalent
because ejection only matters during endpoint picking, which only
happens during RPCs. Suggested by the PR review.

Tests:
  - All algorithm-level tests rewritten to use owned `OutlierDetector`
    + `&mut self` calls, no `Mutex::lock()`, no Arc.
  - Three new `maybe_run_sweep_*` tests cover the interval gate:
    runs on first call, skips before interval elapsed, runs after.
  - Existing failure-percentage and multiplier/un-ejection tests
    unchanged in spirit; just adjusted to the new ownership model.
Pass through every doc comment and inline comment, removing rationale,
timeline language, and explanations that don't help a future reader.
Notable trims:

  - Module docstring drops "Knows nothing about the data path:" framing,
    the "lands in a follow-up PR" timeline (regression — flagged and
    removed earlier on a different doc), and the "(mean and standard
    deviation across the qualifying hosts)" parenthetical.
  - `Rng` trait drops the "Abstracted so tests can inject" rationale.
  - `OutlierDetector` struct drops "State is owned (no `Mutex`, no
    `Arc`):" framing.
  - `add_endpoint` / `remove_endpoint` / `with_rng` lose the trailing
    usage hints / explanatory parentheticals.
  - `maybe_run_sweep` / `run_sweep` tightened to facts-only.
  - Inline comments inside `run_sweep` drop "we model that" and
    "intentionally not yet dispatched in this PR" timeline.
  - Inline comment for the budget-decrement guard now points at
    `Candidate::already_ejected` instead of duplicating its doc.
  - Test `already_ejected_re_ejection_does_not_consume_budget` drops
    the "this would fail before the fix" git-history paragraph.
@LYZJU2019 LYZJU2019 marked this pull request as ready for review May 6, 2026 20:50
LYZJU2019 added 6 commits May 29, 2026 14:00
Eject signals previously rode an unbounded mpsc from the sweep to the
LB — one event per ejection. Switch to a `watch` channel carrying the
full snapshot of ejected addresses (`Arc<HashSet<EndpointAddress>>`).
The sweep rebuilds and broadcasts the snapshot at end of tick, but
only when the ejected-set membership actually changed, gated by an
O(1) compare against an `AtomicU64` version counter incremented in
`try_eject`/`note_uneject`/`remove_channel` of an ejected entry.

Quiet ticks (the vast majority) skip the rebuild and broadcast
entirely. On a tick that did change the set, the sweep iterates the
DashMap once to collect ejected addresses and sends an `Arc` so each
receiver clone is cheap. The LB reads the latest snapshot via
`tokio_stream::wrappers::WatchStream::poll_next`, iterates it, and
for any address still in `ready` moves it into `ejected` (with the
existing remaining-ejection / past-deadline handling). Addresses that
left the snapshot are echoes of the LB's own uneject path and require
no action.

Replaces "unbounded queue" with "snapshot-of-current-state," matching
A50's mental model of "the ejected set is what matters." Quiet-tick
work drops to one atomic load + compare.
Comment thread tonic-xds/src/client/loadbalance/loadbalancer.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/loadbalancer.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs
LYZJU2019 added 5 commits May 29, 2026 19:20
OutlierStatsRegistry::record_outcome was pure delegation to the
channel state — no registry-level mutation, no reason for the
intermediate hop. Move the entry point to OutlierChannelState and
expose ReadyChannel::record_outcome(success) that derefs the held
Arc internally.

LoadBalancer::call's data path no longer extracts the
Arc<OutlierChannelState> or holds an Arc<OutlierStatsRegistry>; it
just clones the ReadyChannel and calls record_outcome on it. Closes
the last hot-path site where the LB peeked at the registry/state for
RPC bookkeeping.
Comment thread tonic-xds/src/client/loadbalance/loadbalancer.rs Outdated
@gu0keno0 gu0keno0 merged commit c528938 into grpc:master Jun 3, 2026
24 checks passed
@ejona86

ejona86 commented Jun 15, 2026

Copy link
Copy Markdown
Member

Hmm... Seems this might have added a flaky test. I got a test failure on #2685 . Re-running passed.

    running 1 test
    test client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state ... FAILED

    failures:

    failures:
        client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 335 filtered out; finished in 0.00s

  stderr ───

    thread 'client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state' (7160) panicked at tonic-xds/src/client/loadbalance/loadbalancer.rs:1017:9:
    assertion `left == right` failed: counters must survive re-insert
      left: (0, 0)
     right: (3, 0)
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@gu0keno0

Copy link
Copy Markdown
Contributor

Hmm... Seems this might have added a flaky test. I got a test failure on #2685 . Re-running passed.

    running 1 test
    test client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state ... FAILED

    failures:

    failures:
        client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state

    test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 335 filtered out; finished in 0.00s

  stderr ───

    thread 'client::loadbalance::loadbalancer::tests::test_outlier_detection_reinsert_preserves_state' (7160) panicked at tonic-xds/src/client/loadbalance/loadbalancer.rs:1017:9:
    assertion `left == right` failed: counters must survive re-insert
      left: (0, 0)
     right: (3, 0)
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I'll take a look!

gu0keno0 pushed a commit that referenced this pull request Jun 16, 2026
…#2689)

## Summary

A flaky test was reported
([link](#2619 (comment)))
after #2619 was merged.

The root cause is: `tokio::time::interval(period)` fires immediately on
its first poll, so the outlier-detection housekeeping actor was running
`run_housekeeping` at t≈0 — before the data path had recorded anything.
That early sweep calls `snapshot_and_reset` and zeros whatever counters
have landed. Therefore,
`test_outlier_detection_reinsert_preserves_state` flakes with `(0, 0) !=
(3, 0)` when the actor's first tick races the assertion read.

Fix: consume the immediate tick before entering the loop so the first
real sweep fires one full `interval` after spawn, matching the gRFC A50
semantics.

## Testing

\`cargo test -p tonic-xds --lib
test_outlier_detection_reinsert_preserves_state\` — 10/10 pass under
stress.
@ejona86

ejona86 commented Jun 16, 2026

Copy link
Copy Markdown
Member

I found a flake on master where the logs are still intact.
https://github.com/grpc/grpc-rust/actions/runs/26962845254/job/79557584053

(But since you've fixed the flake, this doesn't really matter now.)

@gu0keno0

Copy link
Copy Markdown
Contributor

I found a flake on master where the logs are still intact. https://github.com/grpc/grpc-rust/actions/runs/26962845254/job/79557584053

(But since you've fixed the flake, this doesn't really matter now.)

Yeah, I'm looking at it now, and looks like there is another flaky test "test_web::grpc smoke_server_stream" https://github.com/grpc/grpc-rust/actions/runs/27624202442/job/81681297096 . Will keep monitoring.

@ejona86

ejona86 commented Jun 16, 2026

Copy link
Copy Markdown
Member

Yeah, I noticed the smoke_server_stream test failure, but it seemed it might be an older flake with a low flake rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants