Skip to content

feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50)#2604

Merged
gu0keno0 merged 3 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detection-config
Apr 30, 2026
Merged

feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50)#2604
gu0keno0 merged 3 commits into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detection-config

Conversation

@LYZJU2019

@LYZJU2019 LYZJU2019 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

First of a 4-PR series implementing gRFC A50: xDS Outlier Detection in tonic-xds. This PR is type definitions only — no behavior change.

Defines three validated config types that the algorithm will consume:

  • OutlierDetectionConfig — sweep interval, base/max ejection time, max ejection percent, plus the two optional sub-configs.
  • SuccessRateConfigstdev_factor, enforcement_percentage, minimum_hosts, request_volume.
  • FailurePercentageConfigthreshold, enforcement_percentage, minimum_hosts, request_volume.
  • OutlierDetectionConfig::is_enabled() — true iff at least one ejection algorithm is configured.

Sub-configs are Option<_> so the absence of either algorithm is part of the type, not a sentinel value.

Why split this out

The full A50 implementation pulls in (a) proto parsing, (b) the statistical algorithms + sweep engine, (c) per-RPC outcome interception, and (d) wiring into the load-balancing pipeline. Reviewing them in one PR is hard. This series:

  1. This PR — config types only.
  2. Outlier-detection algorithm + sweep engine (pure, unit-tested in isolation).
  3. Per-endpoint RPC outcome interception (a Service wrapper).
  4. Proto parsing into ClusterResource, wire-up in XdsClusterDiscovery (preserving connections on ejection), end-to-end tests, and the lib.rs feature-table update.

Test plan

  • cargo test -p tonic-xds --lib outlier — 3 tests for is_enabled().
  • cargo fmt -p tonic-xds --check clean.
  • cargo clippy -p tonic-xds --lib --all-features -- -D warnings clean.

Define the validated config types consumed by the outlier-detection
algorithm: OutlierDetectionConfig with the global timing/percentage
parameters, plus SuccessRateConfig and FailurePercentageConfig for the
two ejection algorithms.

This PR contains only the type definitions. Proto parsing from
envoy.config.cluster.v3.OutlierDetection and the ClusterResource field
land in a follow-up PR alongside the load-balancing-pipeline wiring,
keeping the algorithm PR self-contained and easy to review.

Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
@LYZJU2019 LYZJU2019 force-pushed the lyzju2019/a50-outlier-detection-config branch from e58a837 to 9229eaa Compare April 24, 2026 21:43
@LYZJU2019 LYZJU2019 changed the title feat(tonic-xds): parse Cluster.outlier_detection config (gRFC A50) feat(tonic-xds): add OutlierDetectionConfig types (gRFC A50) Apr 24, 2026
@LYZJU2019 LYZJU2019 marked this pull request as ready for review April 26, 2026 04:42
Comment thread tonic-xds/src/xds/resource/outlier_detection.rs Outdated
Comment thread tonic-xds/src/xds/resource/outlier_detection.rs
Comment thread tonic-xds/src/xds/resource/outlier_detection.rs
Comment thread tonic-xds/src/xds/resource/outlier_detection.rs Outdated
LYZJU2019 and others added 2 commits April 27, 2026 13:18
- Trim module docstring; drop "lands in a follow-up PR" framing.
- Note in the docstring why there is no `child_policy` yet (tonic-xds
  has only one balancer; the field will land alongside more balancers).
- Rename `enforcement_percentage` → `enforcing_success_rate` /
  `enforcing_failure_percentage` to match the Envoy proto field names.
- Introduce a local `Percentage(u8)` newtype with a fallible constructor
  and use it for `max_ejection_percent`, `enforcing_success_rate`,
  `threshold`, and `enforcing_failure_percentage` so the 0..=100
  invariant is enforced through the type system. Add tests covering
  the constructor's range checks.
Comment on lines +8 to +9
//! load balancer and integrates outlier detection as a filter on the
//! `Discover` stream feeding it, so there is no `child_policy` field

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely accurate. Tonic-xds outlier detection will likely be implemented as a stream / mpsc channel that can be polled by LoadBalancer layer in #2607 . The implementation will need to leverage the EjectedChannel type added in #2587

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment @gu0keno0! Will update the module description in subsequent PRs when #2607 is merged.

/// An endpoint is a candidate for ejection when its success rate falls
/// below `mean - stdev * (stdev_factor / 1000.0)`.
pub stdev_factor: u32,
/// Probability that a candidate is actually ejected.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a threshold for ejecting a host if RPC success rate drops below it?

@gu0keno0 gu0keno0 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also support consecutive 5xx as well. It's not in the original grpc spec but is useful.

Grpc status are not http error codes, however, envoy xds has error matcher config https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/upstreams/http/v3/http_protocol_options.proto#extensions-upstreams-http-v3-httpprotocoloptions-outlierdetection to allow us to specify how to map headers to 5xx.

I think you can proceed with basic A50, and then add support for consecutive 5xx.

@gu0keno0 gu0keno0 merged commit b77a506 into grpc:master Apr 30, 2026
21 checks passed
LYZJU2019 added a commit to LYZJU2019/tonic that referenced this pull request Apr 30, 2026
Address two follow-up review comments from grpc#2604 (the merged config
PR) by folding the doc updates into this PR:

- Module docstring: describe the actual integration plan (an mpsc
  channel of EjectionDecisions polled by LoadBalancer, leveraging
  EjectedChannel) instead of the original "filter on the Discover
  stream" wording. Add intra-doc links to the relevant types.

- enforcing_success_rate / enforcing_failure_percentage: clarify
  that each is the *enforcement probability* — distinct from the
  per-algorithm threshold (stdev_factor for success-rate, threshold
  for failure-percentage). Note that 0 disables enforcement while
  still computing statistics.

Also fix an unresolved intra-doc link in the algorithm module.
gu0keno0 pushed a commit that referenced this pull request Jun 3, 2026
…tage) + LoadBalancer integration (#2619)

## Summary

Implements [gRFC A50: xDS Outlier
Detection](https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md)
(failure-percentage algorithm) in `tonic-xds` and integrates it into
`LoadBalancer`. Config types landed in #2604.

## Architecture

- **Data path** (`OutlierStatsRegistry::record_outcome`, called from
`LoadBalancer::call`) — increments the picked channel's success/failure
counter. Nothing else.
- **Sweep** (`OutlierStatsRegistry::run_housekeeping`, called by the
housekeeping actor on each `config.interval` tick) — snapshots all
channel counters, runs the failure-percentage algorithm against the
snapshot population (applying `minimum_hosts`, `max_ejection_percent`,
threshold, enforcement roll), dispatches eject addresses on an mpsc,
then resets counters and decrements multipliers for non-ejected
channels.
- **Load balancer** — drains the eject mpsc in `poll_ready`, ejects via
`ReadyChannel::eject`, tracks the resulting `EjectedChannel` in
`KeyedFutures<_, UnejectedChannel<_>>`. The picker only sees `ready`, so
ejected channels are unpickable by construction.

Un-ejection is timer-driven per channel: each `EjectedChannel`'s `Sleep`
fires at `min(base × multiplier, max(base, max_ejection_time))` and
yields an `UnejectedChannel`; the LB routes the resolved channel back to
`ready`.

## Constructor interface

`LoadBalancer::new` takes `Arc<ArcSwap<OutlierDetectionConfig>>`.
`OutlierDetectionConfig::default()` is the disabled config (both
algorithms `None`) — no actor spawned, `record_outcome` short-circuits
at the per-state counter increment. The `ArcSwap` shape reserves the
slot for the future xDS-driven config-update path.

## A50 compliance

- Algorithm runs at the interval sweep, not per RPC (§6).
- Failure-percentage uses strict `>` against the threshold.
- Multiplier decrements at the same transition that un-ejects (§6.b).
- `max_ejection_percent` floors at 1 for non-empty pools (spec: "will
eject at least one address regardless of the value").
- Outlier state survives `Change::Insert` for an already-tracked
address.

## Deferred

- Success-rate algorithm.
- Live config-update plumbing (`ArcSwap::store` is supported but the
actor doesn't observe swaps yet).
- Wiring from `ClusterResource` into LB construction.

## Test plan

- [x] `cargo test -p tonic-xds --lib --all-features` — 324 lib tests
pass
- [x] `cargo fmt -p tonic-xds`
- [x] `cargo clippy` clean on changed files

---------

Co-authored-by: YtMa <ytma98@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants