Skip to content

Commit 4f833c5

Browse files
PardhuKonakanchimarkdrothdfawley
authored
A91: Outlier Detection Metrics (#478)
* Create A87-grpc-metrics-xds-outlier-detection.md * Update A87-grpc-metrics-xds-outlier-detection.md * Update A87-grpc-metrics-xds-outlier-detection.md * Update A87-grpc-metrics-xds-outlier-detection.md * Update A87-grpc-metrics-xds-outlier-detection.md Some review feedback * Update A87-grpc-metrics-xds-outlier-detection.md added discussion link * Update and rename A87-grpc-metrics-xds-outlier-detection.md to A91-outlier-detection-metrics.md `detected` metrics description update and rename file * Update A91-outlier-detection-metrics.md Added backend_service optional label * cosmetic changes * Add missing link for A75 * Update A91-outlier-detection-metrics.md Remove gauge metric * Update A91-outlier-detection-metrics.md Clarified how `grpc.lb.backend_service` label will be populated * Update A91-outlier-detection-metrics.md * Update A91-outlier-detection-metrics.md * Update A91-outlier-detection-metrics.md spacing * Update A91-outlier-detection-metrics.md * Update A91-outlier-detection-metrics.md * Update A91-outlier-detection-metrics.md remove total * Update A91-outlier-detection-metrics.md Changed Detected ejections to Unenforced ejections * Update A91-outlier-detection-metrics.md * Update A91-outlier-detection-metrics.md Co-authored-by: Doug Fawley <dfawley@google.com> * Update A91-outlier-detection-metrics.md Fixed naming of label * Update A91-outlier-detection-metrics.md typo * Update A91-outlier-detection-metrics.md Another typo --------- Co-authored-by: Mark D. Roth <roth@google.com> Co-authored-by: Doug Fawley <dfawley@google.com>
1 parent 5ae275c commit 4f833c5

File tree

1 file changed

+77
-0
lines changed

1 file changed

+77
-0
lines changed

A91-outlier-detection-metrics.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
A91: gRPC Metrics for Outlier Detection
2+
---
3+
* Author(s): @pardhukonakanchi, @huntsman90
4+
* Approver: @markdroth
5+
* Status: In Review
6+
* Implemented in:
7+
* Last updated: 2025-03-05
8+
* Discussion at: https://groups.google.com/g/grpc-io/c/iMezDbq5U-g
9+
10+
## Abstract
11+
12+
This document proposes some new metrics that will be added in gRPC for xDS Client Outlier Detection.
13+
14+
## Background
15+
16+
[A50: gRPC xDS Outlier Detection Support][A50] is a spec for gRPC to support Outlier Detection. The current implementation only offers debug and trace logging in terms of visibility, which can be insufficient for understanding and diagnosing decision making in large-scale production systems. 
17+
18+
Using [A79: Non-per-call Metrics Architecture][A79], it is possible to add granular metrics to make visibility into outlier detection easy for service owners utilizing gRPC.
19+
20+
### Related proposals: 
21+
* [A50: gRPC xDS Outlier Detection Support][A50]
22+
* [A66: OpenTelemetry Metrics][A66]
23+
* [A79: Non-per-call Metrics Architecture][A79]
24+
* [A75: xDS Aggregate Cluster Behavior Fixes][A75]
25+
* [A89: Backend Service Metric Label][A89]
26+
27+
[A50]: A50-xds-outlier-detection.md
28+
[A66]: A66-otel-stats.md
29+
[A75]: A75-xds-aggregate-cluster-behavior-fixes.md
30+
[A79]: A79-non-per-call-metrics-architecture.md
31+
[A89]: A89-backend-service-metric-label.md
32+
33+
## Proposal
34+
35+
[A79]’s non-per-call metrics architecture fits perfectly into the metrics reporting solution. We propose to aim for parity where appropriate with [Envoy’s Outlier Detection metric collection](https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#outlier-detection-statistics) to be collected in gRPC.
36+
37+
Outlier Detection metrics will have the following labels:
38+
39+
| Name | Disposition | Description |
40+
| ----------- | ----------- | ----------- |
41+
| grpc.target | required | Indicates the target we are running outlier detection on, as described in [A66]. |
42+
| grpc.lb.outlier_detection.detection_method | required | Indicates the method with which we detected outlier. Currently one of {"success_rate", "failure_percentage"}
43+
| grpc.lb.outlier_detection.unenforced_reason | required | Indicates the reason we did not eject a detected outlier. Currently one of {"enforcement_percentage", "max_ejection_overflow"}
44+
| grpc.lb.backend_service | optional | The backend service to which the traffic is being sent, as described in [A89]. Note that this label will be supported only if [A75] has already been implemented |
45+
46+
The `grpc.lb.backend_service` label will be populated based on the resolver attribute passed down from the cds policy, as described in A89.
47+
48+
The following metrics will be exported:
49+
50+
| Name | Type | Unit | Labels | Description |
51+
| ------------- | ----- | ----- | ------- | ----------- |
52+
| grpc.lb.outlier_detection.ejections_enforced | Counter | {ejection} | grpc.target, grpc.lb.backend_service, grpc.lb.outlier_detection.detection_method | Enforced outlier ejections by ejection reason |
53+
| grpc.lb.outlier_detection.ejections_unenforced | Counter | {ejection} | grpc.target, grpc.lb.backend_service, grpc.lb.outlier_detection.detection_method, grpc.lb.outlier_detection.unenforced_reason | Unenforced outlier ejections due to either max ejection percentage or enforcement_percentage |
54+
55+
On any detection and ejection/unejection, these metrics will be accordingly updated.
56+
57+
### Metric Stability
58+
59+
All metrics added in this proposal will start as experimental and therefore off by default. The long term goal will be to de-experimentalize them and have them be on by default, but the exact criteria for that change are TBD.
60+
61+
### Temporary environment variable protection
62+
63+
This proposal does not include any features enabled via external I/O, so it does not need environment variable protection.
64+
65+
## Rationale
66+
67+
The metrics defined here are generally a trade-off between the usefulness
68+
of the metric and the cost of reporting it. As the design goal is offering parity to envoy metrics,
69+
we decided to include any metric that was appropriate to gRPC outlier detection.
70+
71+
One change from envoy metrics was instead of reporting all detected ejections (enforced or unenforced) for each algorithm type as its own metric, we opted to simply report enforced and unenforced ejections separately. This reduces cost of any detected outlier by 1 metric in the enforced case, and the unenforced count is more likely the direct information a user of the "detected" metric in Envoy is seeking.
72+
73+
Additionally, we combined the ejections_enforced and ejections_unenforced into one metric with a label to provide the ejection/unejection reason.
74+
75+
## Implementation
76+
77+
Dropbox is able to contribute towards a Core and Go implementation, in that order. Implementation of remaining gRPC languages is left for respective gRPC team or other contributors.

0 commit comments

Comments
 (0)