Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability metrics for the gRPC (ext-authz) and HTTP (wristband) servers #225

Merged
merged 9 commits into from
Mar 7, 2022

Conversation

guicassolato
Copy link
Collaborator

@guicassolato guicassolato commented Feb 25, 2022

Closes #162

  • controller-runtime metrics
  • Ext-authz gRPC server metrics
    • AuthConfig metrics (response status, latency)
    • Evaluator metrics (counters, latency) – requires metrics: true set at the level of the evaluator in the AuthConfig (default: false)
  • Builtin-in OIDC/Festival Wristband validation HTTP server metrics
  • Docs

Main metrics exported

Endpoint: /metrics
Metric name Description                                                                                        Labels Type
controller_runtime_reconcile_total Total number of reconciliations per controller controller=authconfig|secret, result=success|error|requeue counter
controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller controller=authconfig|secret counter
controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller controller=authconfig|secret histogram
controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller controller=authconfig|secret gauge
workqueue_adds_total Total number of adds handled by workqueue name=authconfig|secret counter
workqueue_depth Current depth of workqueue name=authconfig|secret gauge
workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested name=authconfig|secret histogram
workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running. name=authconfig|secret gauge
workqueue_retries_total Total number of retries handled by workqueue name=authconfig|secret counter
workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. name=authconfig|secret gauge
workqueue_work_duration_seconds How long in seconds processing an item from workqueue takes. name=authconfig|secret histogram
rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host. code=200|404, method=GET|PUT|POST counter


Endpoint: /server-metrics
Metric name Description Labels Type
auth_server_evaluator_total(*) Total number of evaluations of individual authconfig rule performed by the auth server. namespace, authconfig, evaluator_type, evaluator_name counter
auth_server_evaluator_cancelled(*) Number of evaluations of individual authconfig rule cancelled by the auth server. namespace, authconfig, evaluator_type, evaluator_name counter
auth_server_evaluator_ignored(*) Number of evaluations of individual authconfig rule ignored by the auth server. namespace, authconfig, evaluator_type, evaluator_name counter
auth_server_evaluator_denied(*) Number of denials from individual authconfig rule evaluated by the auth server. namespace, authconfig, evaluator_type, evaluator_name counter
auth_server_evaluator_duration_seconds(*) Response latency of individual authconfig rule evaluated by the auth server (in seconds). namespace, authconfig, evaluator_type, evaluator_name histogram
auth_server_authconfig_total Total number of authconfigs enforced by the auth server, partitioned by authconfig. namespace, authconfig counter
auth_server_authconfig_response_status Response status of authconfigs sent by the auth server, partitioned by authconfig. namespace, authconfig, status=OK|UNAUTHENTICATED,PERMISSION_DENIED counter
auth_server_authconfig_duration_seconds Response latency of authconfig enforced by the auth server (in seconds). namespace, authconfig counter
auth_server_response_status Response status of authconfigs sent by the auth server. status=OK|UNAUTHENTICATED,PERMISSION_DENIED|NOT_FOUND counter
grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure. grpc_code=OK|Aborted|Canceled|DeadlineExceeded|Internal|ResourceExhausted|Unknown, grpc_method=Check, grpc_service=envoy.service.auth.v3.Authorization counter
grpc_server_handling_seconds Response latency (seconds) of gRPC that had been application-level handled by the server. grpc_method=Check, grpc_service=envoy.service.auth.v3.Authorization histogram
grpc_server_msg_received_total Total number of RPC stream messages received on the server. grpc_method=Check, grpc_service=envoy.service.auth.v3.Authorization counter
grpc_server_msg_sent_total Total number of gRPC stream messages sent by the server. grpc_method=Check, grpc_service=envoy.service.auth.v3.Authorization counter
grpc_server_started_total Total number of RPCs started on the server. grpc_method=Check, grpc_service=envoy.service.auth.v3.Authorization counter
oidc_server_requests_total Number of get requests received on the OIDC (Festival Wristband) server. namespace, authconfig, wristband, path=oidc-config|jwks counter
oidc_server_response_status Status of HTTP response sent by the OIDC (Festival Wristband) server. status=200|404 counter

Plus multiple other Golang runtime metrics, such as number of goroutines (go_goroutines) and threads (go_threads), CPU, memory and GC stats.

(*) Opt-in metrics: auth_server_evaluator_* metrics require authconfig.spec.(identity|metadata|authorization|response).metrics: true (default: false). This can be enforced for the entire instance (all AuthConfigs and evaluators), by setting the DEEP_METRICS_ENABLED=true environment variable in the Authorino deployment.

Verification steps

  • make local-setup
  • (Optional) Create an AuthConfig (e.g. API key authn and wristband token)
  • (Optional) Set metrics: true for at least one evaluator of the AuthConfig
  • (Optional) kubectl -n authorino set env deployment/authorino DEEP_METRICS_ENABLED=true
  • (Optional) Send requests to the protected API and to the well-known OIDC endpoints for the Festival Wristband config and JWKS
  • kubectl -n authorino port-forward service/authorino-controller-metrics 8080:8080
  • curl http://localhost:8080/metrics
  • curl http://localhost:8080/server-metrics

@guicassolato guicassolato self-assigned this Feb 25, 2022
@guicassolato guicassolato force-pushed the observability branch 2 times, most recently from 9ddf1eb to ae0b37d Compare March 1, 2022 17:52
| Metric name                             | Description                                                                               | Labels                                                        | Type      |
| --------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------- | --------- |
| auth_server_evaluator_total             | Total number of evaluations of individual authconfig rule performed by the auth server.   | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter   |
| auth_server_evaluator_cancelled         | Number of evaluations of individual authconfig rule cancelled by the auth server.         | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter   |
| auth_server_evaluator_ignored           | Number of evaluations of individual authconfig rule ignored by the auth server.           | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter   |
| auth_server_evaluator_denied            | Number of denials from individual authconfig rule evaluated by the auth server.           | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | counter   |
| auth_server_evaluator_duration_seconds  | Response latency of individual authconfig rule evaluated by the auth server (in seconds). | `namespace`, `authconfig`, `evaluator_type`, `evaluator_name` | histogram |
| auth_server_authconfig_total            | Total number of authconfigs enforced by the auth server, partitioned by authconfig.       | `namespace`, `authconfig`                                     | counter   |
| auth_server_authconfig_response_status  | Response status of authconfigs sent by the auth server, partitioned by authconfig.        | `namespace`, `authconfig`, `status`                           | counter   |
| auth_server_authconfig_duration_seconds | Response latency of authconfig enforced by the auth server (in seconds).                  | `namespace`, `authconfig`                                     | histogram |
| auth_server_response_status             | Response status of authconfigs sent by the auth server.                                   | `status`                                                      | counter   |

Added field option `spec.(identity|metadata|authorization|response).monit: bool` to enable/disable metrics at evaluator granularity level (default: false).
@guicassolato guicassolato marked this pull request as ready for review March 2, 2022 09:25
@guicassolato guicassolato requested review from jjaferson and a team March 2, 2022 09:26
jjaferson
jjaferson previously approved these changes Mar 3, 2022
Copy link
Contributor

@jjaferson jjaferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @guicassolato, left two suggestions there for us to think about.

@@ -140,6 +140,10 @@ type Identity struct {
// +kubebuilder:default:=0
Priority int `json:"priority,omitempty"`

// Whether this identity config should generate individual observability metrics
// +kubebuilder:default:=false
Metrics bool `json:"metrics,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the advantages of enabling metrics at the auth config level, the main point being that auth configs that don't need to be monitored can be discarded which will reduce the cost of generating and processing more data but I was just wondering how confusing it can be to monitor these auth configs at a service level as only the auth configs with the metrics enabled will be exposed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will reduce the cost of generating and processing

I'd also include memory consumption.

I guess I'd rather not have metrics at the AuthConfig level at all, than enabling for all by default. We're talking tons of useless stats.

At the same time, not having those stats can make users' lives very difficult. There are usually simple tweaks that one can do (especially after we improve caching), to improve performance of an AuthConfig, to workaround critical paths momentarily, etc... but only if people are able to know where to act.

I really think we need these metrics and that they have to be opt-in.

how confusing it can be to monitor these auth configs at a service level

I really don't have a good answer to this, I'm afraid. Maybe if it was simpler to set a separate metrics endpoint per AuthConfig (e.g. /server-metrics/my-authconfig-ns/my-authconfig-name); even so, it wouldn't solve the issue of enabled/disabled.

Do you have any suggestion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something similar to this in SSO where each realm has a metrics endpoint. A follow on issue here might be to create a grafana dashboard using these metrics. It would be a practical exercise of trying to use these metrics to visualise useful data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how confusing it can be to monitor these auth configs at a service level

Introduced new env var DEEP_METRICS_ENABLED: enforces deep (evaluator-level) metrics exported for all AuthConfigs, so there is an easier way to set these metrics other than by going AuthConfig by AuthConfig and adding metrics: true for each evaluator.

This will be useful for debugging and enabling deep monitoring at the level of service.


// impl:metrics.Object

func (config *AuthorizationConfig) Measured() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, measured to me sounds more like something related to size or distance, maybe exportMetrics would be easier to read?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to MetricsEnabled().

@maleck13
Copy link
Collaborator

maleck13 commented Mar 4, 2022

This may be something that has already been thought about, but do any of these metrics use labels that can have lots of unique values/ like something generated? These can create quite an overhead on prometheus as each unique value is stored as an individual time series. I mention it because we hit this issue with SSO where it was recording the URL but there were potentially millions of unique values for a particular endpoint

@guicassolato
Copy link
Collaborator Author

@maleck13

This may be something that has already been thought about, but do any of these metrics use labels that can have lots of unique values/ like something generated?

We made sure no user input is used in metric labels. (By "no user input" here, I mean nothing from the Envoy payload to the /Check ext-authz operation; therefore, "user" as in one who sends a request to an API protected with Authorino.)

URL paths are not used in the labels either.

This should be enough to protect against DoS attacks by targeting the metrics endpoints.

We do use AuthConfig names to partition some metrics. The name of the AuthConfig is used for individual measurements of duration and counters (number of hits, response codes) of an AuthConfig (always enabled). Then, we have metrics at the granularity of each evaluator in an AuthConfig – i.e. the name of the evaluator is used in the label. These ones are opt-in metrics.

@guicassolato guicassolato merged commit fa9361d into main Mar 7, 2022
@guicassolato guicassolato deleted the observability branch March 7, 2022 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Observability
3 participants