-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225
Conversation
9ddf1eb
to
ae0b37d
Compare
| Metric name | Description | Labels | Type | | --------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------- | --------- | | auth_server_evaluator_total | Total number of evaluations of individual authconfig rule performed by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_cancelled | Number of evaluations of individual authconfig rule cancelled by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_ignored | Number of evaluations of individual authconfig rule ignored by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_denied | Number of denials from individual authconfig rule evaluated by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_duration_seconds | Response latency of individual authconfig rule evaluated by the auth server (in seconds). | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | histogram | | auth_server_authconfig_total | Total number of authconfigs enforced by the auth server, partitioned by authconfig. | `namespace`, `authconfig` | counter | | auth_server_authconfig_response_status | Response status of authconfigs sent by the auth server, partitioned by authconfig. | `namespace`, `authconfig`, `status` | counter | | auth_server_authconfig_duration_seconds | Response latency of authconfig enforced by the auth server (in seconds). | `namespace`, `authconfig` | histogram | | auth_server_response_status | Response status of authconfigs sent by the auth server. | `status` | counter | Added field option `spec.(identity|metadata|authorization|response).monit: bool` to enable/disable metrics at evaluator granularity level (default: false).
ae0b37d
to
70a68a0
Compare
70a68a0
to
688d168
Compare
35284cd
to
632a745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @guicassolato, left two suggestions there for us to think about.
@@ -140,6 +140,10 @@ type Identity struct { | |||
// +kubebuilder:default:=0 | |||
Priority int `json:"priority,omitempty"` | |||
|
|||
// Whether this identity config should generate individual observability metrics | |||
// +kubebuilder:default:=false | |||
Metrics bool `json:"metrics,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the advantages of enabling metrics at the auth config level, the main point being that auth configs that don't need to be monitored can be discarded which will reduce the cost of generating and processing more data but I was just wondering how confusing it can be to monitor these auth configs at a service level as only the auth configs with the metrics enabled will be exposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will reduce the cost of generating and processing
I'd also include memory consumption.
I guess I'd rather not have metrics at the AuthConfig level at all, than enabling for all by default. We're talking tons of useless stats.
At the same time, not having those stats can make users' lives very difficult. There are usually simple tweaks that one can do (especially after we improve caching), to improve performance of an AuthConfig, to workaround critical paths momentarily, etc... but only if people are able to know where to act.
I really think we need these metrics and that they have to be opt-in.
how confusing it can be to monitor these auth configs at a service level
I really don't have a good answer to this, I'm afraid. Maybe if it was simpler to set a separate metrics endpoint per AuthConfig (e.g. /server-metrics/my-authconfig-ns/my-authconfig-name); even so, it wouldn't solve the issue of enabled/disabled.
Do you have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something similar to this in SSO where each realm has a metrics endpoint. A follow on issue here might be to create a grafana dashboard using these metrics. It would be a practical exercise of trying to use these metrics to visualise useful data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how confusing it can be to monitor these auth configs at a service level
Introduced new env var DEEP_METRICS_ENABLED
: enforces deep (evaluator-level) metrics exported for all AuthConfigs, so there is an easier way to set these metrics other than by going AuthConfig by AuthConfig and adding metrics: true
for each evaluator.
This will be useful for debugging and enabling deep monitoring at the level of service.
pkg/config/authorization.go
Outdated
|
||
// impl:metrics.Object | ||
|
||
func (config *AuthorizationConfig) Measured() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea, measured to me sounds more like something related to size or distance, maybe exportMetrics
would be easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to MetricsEnabled()
.
This may be something that has already been thought about, but do any of these metrics use labels that can have lots of unique values/ like something generated? These can create quite an overhead on prometheus as each unique value is stored as an individual time series. I mention it because we hit this issue with SSO where it was recording the URL but there were potentially millions of unique values for a particular endpoint |
We made sure no user input is used in metric labels. (By "no user input" here, I mean nothing from the Envoy payload to the /Check ext-authz operation; therefore, "user" as in one who sends a request to an API protected with Authorino.) URL paths are not used in the labels either. This should be enough to protect against DoS attacks by targeting the metrics endpoints. We do use AuthConfig names to partition some metrics. The name of the AuthConfig is used for individual measurements of duration and counters (number of hits, response codes) of an AuthConfig (always enabled). Then, we have metrics at the granularity of each evaluator in an AuthConfig – i.e. the name of the evaluator is used in the label. These ones are opt-in metrics. |
273580a
to
08a99f9
Compare
…d for all AuthConfigs
08a99f9
to
ea6e5d1
Compare
Closes #162
metrics: true
set at the level of the evaluator in the AuthConfig (default:false
)Main metrics exported
controller=authconfig|secret
,result=success|error|requeue
controller=authconfig|secret
controller=authconfig|secret
controller=authconfig|secret
name=authconfig|secret
name=authconfig|secret
name=authconfig|secret
name=authconfig|secret
name=authconfig|secret
name=authconfig|secret
name=authconfig|secret
code=200|404
,method=GET|PUT|POST
Endpoint: /server-metrics
namespace
,authconfig
,evaluator_type
,evaluator_name
namespace
,authconfig
,evaluator_type
,evaluator_name
namespace
,authconfig
,evaluator_type
,evaluator_name
namespace
,authconfig
,evaluator_type
,evaluator_name
namespace
,authconfig
,evaluator_type
,evaluator_name
namespace
,authconfig
namespace
,authconfig
,status=OK|UNAUTHENTICATED,PERMISSION_DENIED
namespace
,authconfig
status=OK|UNAUTHENTICATED,PERMISSION_DENIED|NOT_FOUND
grpc_code=OK|Aborted|Canceled|DeadlineExceeded|Internal|ResourceExhausted|Unknown
,grpc_method=Check
,grpc_service=envoy.service.auth.v3.Authorization
grpc_method=Check
,grpc_service=envoy.service.auth.v3.Authorization
grpc_method=Check
,grpc_service=envoy.service.auth.v3.Authorization
grpc_method=Check
,grpc_service=envoy.service.auth.v3.Authorization
grpc_method=Check
,grpc_service=envoy.service.auth.v3.Authorization
namespace
,authconfig
,wristband
,path=oidc-config|jwks
status=200|404
Plus multiple other Golang runtime metrics, such as number of goroutines (go_goroutines) and threads (go_threads), CPU, memory and GC stats.
(*) Opt-in metrics:
auth_server_evaluator_*
metrics requireauthconfig.spec.(identity|metadata|authorization|response).metrics: true
(default:false
). This can be enforced for the entire instance (all AuthConfigs and evaluators), by setting theDEEP_METRICS_ENABLED=true
environment variable in the Authorino deployment.Verification steps
make local-setup
metrics: true
for at least one evaluator of the AuthConfigkubectl -n authorino set env deployment/authorino DEEP_METRICS_ENABLED=true
kubectl -n authorino port-forward service/authorino-controller-metrics 8080:8080