feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50)#2604
Merged
gu0keno0 merged 3 commits intoApr 30, 2026
Conversation
Define the validated config types consumed by the outlier-detection algorithm: OutlierDetectionConfig with the global timing/percentage parameters, plus SuccessRateConfig and FailurePercentageConfig for the two ejection algorithms. This PR contains only the type definitions. Proto parsing from envoy.config.cluster.v3.OutlierDetection and the ClusterResource field land in a follow-up PR alongside the load-balancing-pipeline wiring, keeping the algorithm PR self-contained and easy to review. Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
e58a837 to
9229eaa
Compare
ankurmittal
reviewed
Apr 27, 2026
ankurmittal
reviewed
Apr 27, 2026
ankurmittal
reviewed
Apr 27, 2026
YutaoMa
reviewed
Apr 27, 2026
- Trim module docstring; drop "lands in a follow-up PR" framing. - Note in the docstring why there is no `child_policy` yet (tonic-xds has only one balancer; the field will land alongside more balancers). - Rename `enforcement_percentage` → `enforcing_success_rate` / `enforcing_failure_percentage` to match the Envoy proto field names. - Introduce a local `Percentage(u8)` newtype with a fallible constructor and use it for `max_ejection_percent`, `enforcing_success_rate`, `threshold`, and `enforcing_failure_percentage` so the 0..=100 invariant is enforced through the type system. Add tests covering the constructor's range checks.
gu0keno0
approved these changes
Apr 29, 2026
Comment on lines
+8
to
+9
| //! load balancer and integrates outlier detection as a filter on the | ||
| //! `Discover` stream feeding it, so there is no `child_policy` field |
Contributor
Contributor
Author
| /// An endpoint is a candidate for ejection when its success rate falls | ||
| /// below `mean - stdev * (stdev_factor / 1000.0)`. | ||
| pub stdev_factor: u32, | ||
| /// Probability that a candidate is actually ejected. |
Contributor
There was a problem hiding this comment.
I think this is a threshold for ejecting a host if RPC success rate drops below it?
gu0keno0
reviewed
Apr 29, 2026
gu0keno0
left a comment
Contributor
There was a problem hiding this comment.
We should also support consecutive 5xx as well. It's not in the original grpc spec but is useful.
Grpc status are not http error codes, however, envoy xds has error matcher config https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/upstreams/http/v3/http_protocol_options.proto#extensions-upstreams-http-v3-httpprotocoloptions-outlierdetection to allow us to specify how to map headers to 5xx.
I think you can proceed with basic A50, and then add support for consecutive 5xx.
3 tasks
LYZJU2019
added a commit
to LYZJU2019/tonic
that referenced
this pull request
Apr 30, 2026
Address two follow-up review comments from grpc#2604 (the merged config PR) by folding the doc updates into this PR: - Module docstring: describe the actual integration plan (an mpsc channel of EjectionDecisions polled by LoadBalancer, leveraging EjectedChannel) instead of the original "filter on the Discover stream" wording. Add intra-doc links to the relevant types. - enforcing_success_rate / enforcing_failure_percentage: clarify that each is the *enforcement probability* — distinct from the per-algorithm threshold (stdev_factor for success-rate, threshold for failure-percentage). Note that 0 disables enforcement while still computing statistics. Also fix an unresolved intra-doc link in the algorithm module.
gu0keno0
pushed a commit
that referenced
this pull request
Jun 3, 2026
…tage) + LoadBalancer integration (#2619) ## Summary Implements [gRFC A50: xDS Outlier Detection](https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md) (failure-percentage algorithm) in `tonic-xds` and integrates it into `LoadBalancer`. Config types landed in #2604. ## Architecture - **Data path** (`OutlierStatsRegistry::record_outcome`, called from `LoadBalancer::call`) — increments the picked channel's success/failure counter. Nothing else. - **Sweep** (`OutlierStatsRegistry::run_housekeeping`, called by the housekeeping actor on each `config.interval` tick) — snapshots all channel counters, runs the failure-percentage algorithm against the snapshot population (applying `minimum_hosts`, `max_ejection_percent`, threshold, enforcement roll), dispatches eject addresses on an mpsc, then resets counters and decrements multipliers for non-ejected channels. - **Load balancer** — drains the eject mpsc in `poll_ready`, ejects via `ReadyChannel::eject`, tracks the resulting `EjectedChannel` in `KeyedFutures<_, UnejectedChannel<_>>`. The picker only sees `ready`, so ejected channels are unpickable by construction. Un-ejection is timer-driven per channel: each `EjectedChannel`'s `Sleep` fires at `min(base × multiplier, max(base, max_ejection_time))` and yields an `UnejectedChannel`; the LB routes the resolved channel back to `ready`. ## Constructor interface `LoadBalancer::new` takes `Arc<ArcSwap<OutlierDetectionConfig>>`. `OutlierDetectionConfig::default()` is the disabled config (both algorithms `None`) — no actor spawned, `record_outcome` short-circuits at the per-state counter increment. The `ArcSwap` shape reserves the slot for the future xDS-driven config-update path. ## A50 compliance - Algorithm runs at the interval sweep, not per RPC (§6). - Failure-percentage uses strict `>` against the threshold. - Multiplier decrements at the same transition that un-ejects (§6.b). - `max_ejection_percent` floors at 1 for non-empty pools (spec: "will eject at least one address regardless of the value"). - Outlier state survives `Change::Insert` for an already-tracked address. ## Deferred - Success-rate algorithm. - Live config-update plumbing (`ArcSwap::store` is supported but the actor doesn't observe swaps yet). - Wiring from `ClusterResource` into LB construction. ## Test plan - [x] `cargo test -p tonic-xds --lib --all-features` — 324 lib tests pass - [x] `cargo fmt -p tonic-xds` - [x] `cargo clippy` clean on changed files --------- Co-authored-by: YtMa <ytma98@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First of a 4-PR series implementing gRFC A50: xDS Outlier Detection in
tonic-xds. This PR is type definitions only — no behavior change.Defines three validated config types that the algorithm will consume:
OutlierDetectionConfig— sweep interval, base/max ejection time, max ejection percent, plus the two optional sub-configs.SuccessRateConfig—stdev_factor,enforcement_percentage,minimum_hosts,request_volume.FailurePercentageConfig—threshold,enforcement_percentage,minimum_hosts,request_volume.OutlierDetectionConfig::is_enabled()— true iff at least one ejection algorithm is configured.Sub-configs are
Option<_>so the absence of either algorithm is part of the type, not a sentinel value.Why split this out
The full A50 implementation pulls in (a) proto parsing, (b) the statistical algorithms + sweep engine, (c) per-RPC outcome interception, and (d) wiring into the load-balancing pipeline. Reviewing them in one PR is hard. This series:
Servicewrapper).ClusterResource, wire-up inXdsClusterDiscovery(preserving connections on ejection), end-to-end tests, and thelib.rsfeature-table update.Test plan
cargo test -p tonic-xds --lib outlier— 3 tests foris_enabled().cargo fmt -p tonic-xds --checkclean.cargo clippy -p tonic-xds --lib --all-features -- -D warningsclean.