Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

guicassolato · 2022-02-25T17:48:58Z

Closes #162

controller-runtime metrics
Ext-authz gRPC server metrics
- AuthConfig metrics (response status, latency)
- Evaluator metrics (counters, latency) – requires metrics: true set at the level of the evaluator in the AuthConfig (default: false)
Builtin-in OIDC/Festival Wristband validation HTTP server metrics
Docs

Main metrics exported

Metric name	Description	Labels	Type
Endpoint: /metrics
controller_runtime_reconcile_total	Total number of reconciliations per controller	`controller=authconfig\|secret`, `result=success\|error\|requeue`	counter
controller_runtime_reconcile_errors_total	Total number of reconciliation errors per controller	`controller=authconfig\|secret`	counter
controller_runtime_reconcile_time_seconds	Length of time per reconciliation per controller	`controller=authconfig\|secret`	histogram
controller_runtime_max_concurrent_reconciles	Maximum number of concurrent reconciles per controller	`controller=authconfig\|secret`	gauge
workqueue_adds_total	Total number of adds handled by workqueue	`name=authconfig\|secret`	counter
workqueue_depth	Current depth of workqueue	`name=authconfig\|secret`	gauge
workqueue_queue_duration_seconds	How long in seconds an item stays in workqueue before being requested	`name=authconfig\|secret`	histogram
workqueue_longest_running_processor_seconds	How many seconds has the longest running processor for workqueue been running.	`name=authconfig\|secret`	gauge
workqueue_retries_total	Total number of retries handled by workqueue	`name=authconfig\|secret`	counter
workqueue_unfinished_work_seconds	How many seconds of work has been done that is in progress and hasn't been observed by work_duration.	`name=authconfig\|secret`	gauge
workqueue_work_duration_seconds	How long in seconds processing an item from workqueue takes.	`name=authconfig\|secret`	histogram
rest_client_requests_total	Number of HTTP requests, partitioned by status code, method, and host.	`code=200\|404`, `method=GET\|PUT\|POST`	counter
Endpoint: /server-metrics
Metric name	Description	Labels	Type
auth_server_evaluator_total(*)	Total number of evaluations of individual authconfig rule performed by the auth server.	`namespace`, `authconfig`, `evaluator_type`, `evaluator_name`	counter
auth_server_evaluator_cancelled(*)	Number of evaluations of individual authconfig rule cancelled by the auth server.	`namespace`, `authconfig`, `evaluator_type`, `evaluator_name`	counter
auth_server_evaluator_ignored(*)	Number of evaluations of individual authconfig rule ignored by the auth server.	`namespace`, `authconfig`, `evaluator_type`, `evaluator_name`	counter
auth_server_evaluator_denied(*)	Number of denials from individual authconfig rule evaluated by the auth server.	`namespace`, `authconfig`, `evaluator_type`, `evaluator_name`	counter
auth_server_evaluator_duration_seconds(*)	Response latency of individual authconfig rule evaluated by the auth server (in seconds).	`namespace`, `authconfig`, `evaluator_type`, `evaluator_name`	histogram
auth_server_authconfig_total	Total number of authconfigs enforced by the auth server, partitioned by authconfig.	`namespace`, `authconfig`	counter
auth_server_authconfig_response_status	Response status of authconfigs sent by the auth server, partitioned by authconfig.	`namespace`, `authconfig`, `status=OK\|UNAUTHENTICATED,PERMISSION_DENIED`	counter
auth_server_authconfig_duration_seconds	Response latency of authconfig enforced by the auth server (in seconds).	`namespace`, `authconfig`	counter
auth_server_response_status	Response status of authconfigs sent by the auth server.	`status=OK\|UNAUTHENTICATED,PERMISSION_DENIED\|NOT_FOUND`	counter
grpc_server_handled_total	Total number of RPCs completed on the server, regardless of success or failure.	`grpc_code=OK\|Aborted\|Canceled\|DeadlineExceeded\|Internal\|ResourceExhausted\|Unknown`, `grpc_method=Check`, `grpc_service=envoy.service.auth.v3.Authorization`	counter
grpc_server_handling_seconds	Response latency (seconds) of gRPC that had been application-level handled by the server.	`grpc_method=Check`, `grpc_service=envoy.service.auth.v3.Authorization`	histogram
grpc_server_msg_received_total	Total number of RPC stream messages received on the server.	`grpc_method=Check`, `grpc_service=envoy.service.auth.v3.Authorization`	counter
grpc_server_msg_sent_total	Total number of gRPC stream messages sent by the server.	`grpc_method=Check`, `grpc_service=envoy.service.auth.v3.Authorization`	counter
grpc_server_started_total	Total number of RPCs started on the server.	`grpc_method=Check`, `grpc_service=envoy.service.auth.v3.Authorization`	counter
oidc_server_requests_total	Number of get requests received on the OIDC (Festival Wristband) server.	`namespace`, `authconfig`, `wristband`, `path=oidc-config\|jwks`	counter
oidc_server_response_status	Status of HTTP response sent by the OIDC (Festival Wristband) server.	`status=200\|404`	counter

Plus multiple other Golang runtime metrics, such as number of goroutines (go_goroutines) and threads (go_threads), CPU, memory and GC stats.

(*) Opt-in metrics: auth_server_evaluator_* metrics require authconfig.spec.(identity|metadata|authorization|response).metrics: true (default: false). This can be enforced for the entire instance (all AuthConfigs and evaluators), by setting the DEEP_METRICS_ENABLED=true environment variable in the Authorino deployment.

Verification steps

make local-setup
(Optional) Create an AuthConfig (e.g. API key authn and wristband token)
(Optional) Set metrics: true for at least one evaluator of the AuthConfig
(Optional) kubectl -n authorino set env deployment/authorino DEEP_METRICS_ENABLED=true
(Optional) Send requests to the protected API and to the well-known OIDC endpoints for the Festival Wristband config and JWKS
kubectl -n authorino port-forward service/authorino-controller-metrics 8080:8080
curl http://localhost:8080/metrics
curl http://localhost:8080/server-metrics

…ervers

| Metric name | Description | Labels | Type | | --------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------- | --------- | | auth_server_evaluator_total | Total number of evaluations of individual authconfig rule performed by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_cancelled | Number of evaluations of individual authconfig rule cancelled by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_ignored | Number of evaluations of individual authconfig rule ignored by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_denied | Number of denials from individual authconfig rule evaluated by the auth server. | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter | | auth_server_evaluator_duration_seconds | Response latency of individual authconfig rule evaluated by the auth server (in seconds). | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | histogram | | auth_server_authconfig_total | Total number of authconfigs enforced by the auth server, partitioned by authconfig. | `namespace`, `authconfig` | counter | | auth_server_authconfig_response_status | Response status of authconfigs sent by the auth server, partitioned by authconfig. | `namespace`, `authconfig`, `status` | counter | | auth_server_authconfig_duration_seconds | Response latency of authconfig enforced by the auth server (in seconds). | `namespace`, `authconfig` | histogram | | auth_server_response_status | Response status of authconfigs sent by the auth server. | `status` | counter | Added field option `spec.(identity|metadata|authorization|response).monit: bool` to enable/disable metrics at evaluator granularity level (default: false).

jjaferson

Great work @guicassolato, left two suggestions there for us to think about.

jjaferson · 2022-03-03T19:06:32Z

api/v1beta1/auth_config_types.go

@@ -140,6 +140,10 @@ type Identity struct {
 	// +kubebuilder:default:=0
 	Priority int `json:"priority,omitempty"`

+	// Whether this identity config should generate individual observability metrics
+	// +kubebuilder:default:=false
+	Metrics bool `json:"metrics,omitempty"`


I see the advantages of enabling metrics at the auth config level, the main point being that auth configs that don't need to be monitored can be discarded which will reduce the cost of generating and processing more data but I was just wondering how confusing it can be to monitor these auth configs at a service level as only the auth configs with the metrics enabled will be exposed.

will reduce the cost of generating and processing

I'd also include memory consumption.

I guess I'd rather not have metrics at the AuthConfig level at all, than enabling for all by default. We're talking tons of useless stats.

At the same time, not having those stats can make users' lives very difficult. There are usually simple tweaks that one can do (especially after we improve caching), to improve performance of an AuthConfig, to workaround critical paths momentarily, etc... but only if people are able to know where to act.

I really think we need these metrics and that they have to be opt-in.

how confusing it can be to monitor these auth configs at a service level

I really don't have a good answer to this, I'm afraid. Maybe if it was simpler to set a separate metrics endpoint per AuthConfig (e.g. /server-metrics/my-authconfig-ns/my-authconfig-name); even so, it wouldn't solve the issue of enabled/disabled.

Do you have any suggestion?

There is something similar to this in SSO where each realm has a metrics endpoint. A follow on issue here might be to create a grafana dashboard using these metrics. It would be a practical exercise of trying to use these metrics to visualise useful data?

how confusing it can be to monitor these auth configs at a service level

Introduced new env var DEEP_METRICS_ENABLED: enforces deep (evaluator-level) metrics exported for all AuthConfigs, so there is an easier way to set these metrics other than by going AuthConfig by AuthConfig and adding metrics: true for each evaluator.

This will be useful for debugging and enabling deep monitoring at the level of service.

jjaferson · 2022-03-03T19:18:01Z

pkg/config/authorization.go

+
+// impl:metrics.Object
+
+func (config *AuthorizationConfig) Measured() bool {


Just an idea, measured to me sounds more like something related to size or distance, maybe exportMetrics would be easier to read?

Changed to MetricsEnabled().

maleck13 · 2022-03-04T09:15:40Z

This may be something that has already been thought about, but do any of these metrics use labels that can have lots of unique values/ like something generated? These can create quite an overhead on prometheus as each unique value is stored as an individual time series. I mention it because we hit this issue with SSO where it was recording the URL but there were potentially millions of unique values for a particular endpoint

guicassolato · 2022-03-04T10:05:40Z

@maleck13

This may be something that has already been thought about, but do any of these metrics use labels that can have lots of unique values/ like something generated?

We made sure no user input is used in metric labels. (By "no user input" here, I mean nothing from the Envoy payload to the /Check ext-authz operation; therefore, "user" as in one who sends a request to an API protected with Authorino.)

URL paths are not used in the labels either.

This should be enough to protect against DoS attacks by targeting the metrics endpoints.

We do use AuthConfig names to partition some metrics. The name of the AuthConfig is used for individual measurements of duration and counters (number of hits, response codes) of an AuthConfig (always enabled). Then, we have metrics at the granularity of each evaluator in an AuthConfig – i.e. the name of the evaluator is used in the label. These ones are opt-in metrics.

…d for all AuthConfigs

guicassolato self-assigned this Feb 25, 2022

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) s…

c1b222a

…ervers

guicassolato force-pushed the observability branch 2 times, most recently from 9ddf1eb to ae0b37d Compare March 1, 2022 17:52

guicassolato force-pushed the observability branch from ae0b37d to 70a68a0 Compare March 1, 2022 18:45

github.com/kuadrant/authorino/pkg/metrics package

688d168

guicassolato force-pushed the observability branch from 70a68a0 to 688d168 Compare March 1, 2022 20:25

pkg/metrics decoupled from pkg/common

e578f81

guicassolato marked this pull request as ready for review March 2, 2022 09:25

guicassolato requested review from jjaferson and a team March 2, 2022 09:26

guicassolato added 3 commits March 2, 2022 16:25

AuthConfig metrics extracted to separate file

8796cb6

Renamed AuthConfig field spec.(i|m|a|r).monit -> .metrics

d34a508

Metrics documentation

632a745

guicassolato force-pushed the observability branch from 35284cd to 632a745 Compare March 2, 2022 15:25

jjaferson previously approved these changes Mar 3, 2022

View reviewed changes

Refactoring: metrics.Measured() renamed to metrics.MetricsEnabled()

c90d4b1

guicassolato dismissed jjaferson’s stale review via c90d4b1 March 4, 2022 10:18

guicassolato force-pushed the observability branch from 273580a to 08a99f9 Compare March 4, 2022 13:26

DEEP_METRICS_ENABLED: enforces deep (evaluator-level) metrics exporte…

ea6e5d1

…d for all AuthConfigs

guicassolato force-pushed the observability branch from 08a99f9 to ea6e5d1 Compare March 4, 2022 13:27

jjaferson approved these changes Mar 7, 2022

View reviewed changes

guicassolato merged commit fa9361d into main Mar 7, 2022

guicassolato deleted the observability branch March 7, 2022 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

guicassolato commented Feb 25, 2022 •

edited by R-Lawton

Loading

jjaferson left a comment

jjaferson Mar 3, 2022

guicassolato Mar 4, 2022

maleck13 Mar 4, 2022

guicassolato Mar 4, 2022

jjaferson Mar 3, 2022

guicassolato Mar 4, 2022

maleck13 commented Mar 4, 2022

guicassolato commented Mar 4, 2022


		// impl:metrics.Object

		func (config *AuthorizationConfig) Measured() bool {

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

Conversation

guicassolato commented Feb 25, 2022 • edited by R-Lawton Loading

Main metrics exported

Verification steps

jjaferson left a comment

Choose a reason for hiding this comment

jjaferson Mar 3, 2022

Choose a reason for hiding this comment

guicassolato Mar 4, 2022

Choose a reason for hiding this comment

maleck13 Mar 4, 2022

Choose a reason for hiding this comment

guicassolato Mar 4, 2022

Choose a reason for hiding this comment

jjaferson Mar 3, 2022

Choose a reason for hiding this comment

guicassolato Mar 4, 2022

Choose a reason for hiding this comment

maleck13 commented Mar 4, 2022

guicassolato commented Mar 4, 2022

guicassolato commented Feb 25, 2022 •

edited by R-Lawton

Loading