-
Notifications
You must be signed in to change notification settings - Fork 249
A91: Outlier Detection Metrics #478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some review feedback
added discussion link
…tlier-detection-metrics.md `detected` metrics description update and rename file
Added backend_service optional label
Remove gauge metric
Clarified how `grpc.lb.backend_service` label will be populated
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good!
@dfawley, waiting for your review.
A91-outlier-detection-metrics.md
Outdated
|
|
||
| | Name | Type | Unit | Labels | Description | | ||
| | ------------- | ----- | ----- | ------- | ----------- | | ||
| | grpc.lb.outlier_detection.ejections_enforced_total | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Total enforced ejections due to any outlier type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://opentelemetry.io/docs/specs/semconv/general/naming/#do-not-use-total, we should not append _total
A91-outlier-detection-metrics.md
Outdated
| | grpc.lb.outlier_detection.ejections_overflow | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Number of ejections aborted due to max ejection percentage | | ||
| | grpc.lb.outlier_detection.ejections_enforced_success_rate | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Enforced success rate outlier ejections | | ||
| | grpc.lb.outlier_detection.ejections_detected_success_rate | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Detected (even if unenforced) success rate outlier ejections | | ||
| | grpc.lb.outlier_detection.ejections_enforced_failure_percentage | Counter | {ejection} | grpc.target, grpc.lb.backend_service | Enforced failure percentage outlier ejections | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just mention that the alternative way to state this would be to have an attribute/label for whether an ejection is enforced or not. With that, we'll only be incrementing one counter in the case where we have an ejection that is enforced, as opposed to two with the current proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashykt This is a good point. I followed the approach Envoy took, but I'm open to updating this to ejections_enforced and ejections_unenforced. This may also be easier for users who are searching for one or the other, rather than total detected.
One thing is we should probably specify reasons an outlier is unenforced. The two I can think of are doesn't pass enforcement percentage or we are already above max_ejection_percent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashykt @markdroth I went ahead and updated the metric to be unenforced instead of detected and included the Rationale. If we want to keep as-is, can revert, but I do think it makes sense this way as a user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing is we should probably specify reasons an outlier is unenforced. The two I can think of are doesn't pass enforcement percentage or we are already above max_ejection_percent
This could be a label if you still want to do this.
It also seems like we could use labels and only have two metrics:
- ejections_enforced w/label for ejection reason (overflow, success_rate, failure_percentage, other)
- ejections_unenforced w/ejection reason label and a label for the unenforcement reason (ejection percentage / max ejections)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dfawley Great idea! Just made an update accordingly. Let me know what you think
remove total
Changed Detected ejections to Unenforced ejections
Co-authored-by: Doug Fawley <dfawley@google.com>
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here -- I've been super busy lately.
Just one minor comment remaining from me, otherwise this looks great!
dfawley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM module Mark's comment.
Fixed naming of label
Another typo
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
|
@markdroth Thanks for all the help reviewing this! Anything blocking us from merging the gRFC now? |
Nothing other than the fact that I forgot to hit the button. :) Thanks for the contribution! |
gRFC for gRPC metrics for xDS Outlier Detection