Skip to content

Conversation

@PardhuKonakanchi
Copy link
Contributor

gRFC for gRPC metrics for xDS Outlier Detection

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 4, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this gRFC!

Overall, this looks reasonable. I've provided some initial comments here, but I'd also like input from @ejona86, @dfawley, and @yashykt.

Please let me know if you have any questions.

@markdroth markdroth changed the title A87-grpc-metrics-xds-outlier-detection.md A91: Outlier Detection Metrics Mar 5, 2025
PardhuKonakanchi and others added 4 commits March 19, 2025 12:54
…tlier-detection-metrics.md

`detected` metrics description update and rename file
Added backend_service optional label
@markdroth
Copy link
Member

This looks great to me!

Waiting on reviews from @ejona86, @dfawley, and @yashykt.

Clarified how `grpc.lb.backend_service` label will be populated
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

@dfawley, waiting for your review.


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.lb.outlier_detection.ejections_enforced_total | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Total enforced ejections due to any outlier type |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| grpc.lb.outlier_detection.ejections_overflow | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Number of ejections aborted due to max ejection percentage |
| grpc.lb.outlier_detection.ejections_enforced_success_rate | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Enforced success rate outlier ejections |
| grpc.lb.outlier_detection.ejections_detected_success_rate | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Detected (even if unenforced) success rate outlier ejections |
| grpc.lb.outlier_detection.ejections_enforced_failure_percentage | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Enforced failure percentage outlier ejections |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just mention that the alternative way to state this would be to have an attribute/label for whether an ejection is enforced or not. With that, we'll only be incrementing one counter in the case where we have an ejection that is enforced, as opposed to two with the current proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashykt This is a good point. I followed the approach Envoy took, but I'm open to updating this to ejections_enforced and ejections_unenforced. This may also be easier for users who are searching for one or the other, rather than total detected.

One thing is we should probably specify reasons an outlier is unenforced. The two I can think of are doesn't pass enforcement percentage or we are already above max_ejection_percent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yashykt @markdroth I went ahead and updated the metric to be unenforced instead of detected and included the Rationale. If we want to keep as-is, can revert, but I do think it makes sense this way as a user

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing is we should probably specify reasons an outlier is unenforced. The two I can think of are doesn't pass enforcement percentage or we are already above max_ejection_percent

This could be a label if you still want to do this.

It also seems like we could use labels and only have two metrics:

  1. ejections_enforced w/label for ejection reason (overflow, success_rate, failure_percentage, other)
  2. ejections_unenforced w/ejection reason label and a label for the unenforcement reason (ejection percentage / max ejections)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dfawley Great idea! Just made an update accordingly. Let me know what you think

Changed Detected ejections to Unenforced ejections
Co-authored-by: Doug Fawley <dfawley@google.com>
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here -- I've been super busy lately.

Just one minor comment remaining from me, otherwise this looks great!

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM module Mark's comment.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@PardhuKonakanchi
Copy link
Contributor Author

@markdroth Thanks for all the help reviewing this! Anything blocking us from merging the gRFC now?

@markdroth
Copy link
Member

@markdroth Thanks for all the help reviewing this! Anything blocking us from merging the gRFC now?

Nothing other than the fact that I forgot to hit the button. :)

Thanks for the contribution!

@markdroth markdroth merged commit 4f833c5 into grpc:master Jul 8, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants