Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus rules for client-go #1272

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions bindata/assets/alerts/client-go-requests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: client-go
namespace: openshift-kube-apiserver
spec:
groups:
- name: control-plane-client-go
rules:
# record request duration rate by:
# - job (kubelet, apiserver. controller-manager, ...)
# - destination host: internal-lb, services, localhost
- record: internal_load_balancer:rest_client_request_duration_seconds:rate5m
expr: |
sum(rate(
label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://api-int.*"},
"host","$1","url","https?://([^/\\s]+).*")[5m:30s]
)) by (le,host,service,namespace,node)
- record: service:rest_client_request_duration_seconds:rate5m
expr: |
sum(rate(
label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url!~"https?://(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},
"host","$1","url","https?://([^/\\s]+).*")[5m:30s]
)) by (le,host,service,namespace)
- record: pod:rest_client_request_duration_seconds:rate5m
expr: |
sum(rate(
label_replace(rest_client_request_duration_seconds_bucket{verb="GET",url=~"https?://(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"},
"host","$1","url","https?://([^/\\s]+).*")[5m:30s]
)) by (le,host,service,namespace)
# latency by destination: internal-lb, services, localhost
- record: internal_load_balancer:rest_client_request_duration_seconds:histogram_quantile
labels:
quantile: "0.99"
expr: |
histogram_quantile(0.99, sum by (le) (internal_load_balancer:rest_client_request_duration_seconds:rate5m))
- record: service:rest_client_request_duration_seconds:histogram_quantile
labels:
quantile: "0.99"
expr: |
histogram_quantile(0.99, sum by (le) (service:rest_client_request_duration_seconds:rate5m))
- record: pod:rest_client_request_duration_seconds:histogram_quantile
labels:
quantile: "0.99"
expr: |
histogram_quantile(0.99, sum by (le) (pod:rest_client_request_duration_seconds:rate5m))
# error by host and job, the rest_client metrics use the code "<error>" to aggregate
# any non-http error and avoid cardinality problems.
# xref: https://github.com/kubernetes/kubernetes/blob/66931c9b8f11a3f223ba1890e4f390cff74ce1a6/staging/src/k8s.io/client-go/rest/request.go#L785-L799
- record: internal_load_balancer:rest_client_requests_errors:ratio_rate5m
expr: |
sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m])) by (host,service,namespace)
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
aojea marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +52 to +54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as:

Suggested change
sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m])) by (host,service,namespace)
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))
sum(rate(rest_client_requests_total{code="<error>",host=~"api-int.*"}[5m]))
/
sum(rate(rest_client_requests_total{host=~"api-int.*"}[5m]))

- record: service:rest_client_requests:errors:ratio_rate5m
expr: |
sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m])) by (host,service,namespace)
aojea marked this conversation as resolved.
Show resolved Hide resolved
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
Comment on lines +57 to +59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as:

Suggested change
sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m])) by (host,service,namespace)
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
sum(rate(rest_client_requests_total{code="<error>",host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
/
sum(rate(rest_client_requests_total{host!~"(api-int|\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, since your change to remove the url path right?
or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think that is not the same, it aggregates based on those 3 fields, I can't remember now the reason but I'm sure that I had to do it this way because of a reason (duplicates??)

- record: pod:rest_client_requests:errors:ratio_rate5m
expr: |
sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m])) by (host,service,namespace)
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
Comment on lines +62 to +64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as:

Suggested change
sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m])) by (host,service,namespace)
/ ignoring (host,service,namespace) group_left
sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
sum(rate(rest_client_requests_total{code="<error>",host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))
/
sum(rate(rest_client_requests_total{host=~"(\\[::1\\]|127\\.0\\.0\\.1|localhost).*"}[5m]))

# errors by destination
- record: internal_load_balancer:rest_client_requests_errors:rate5m:sum
expr: |
sum(internal_load_balancer:rest_client_requests_errors:ratio_rate5m)
- record: service:rest_client_requests:errors:rate5m:sum
expr: |
sum(service:rest_client_requests:errors:ratio_rate5m)
- record: pod:rest_client_requests:errors:rate5m:sum
expr: |
sum(pod:rest_client_requests:errors:ratio_rate5m)
1 change: 1 addition & 0 deletions pkg/operator/starter.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ func RunOperator(ctx context.Context, controllerContext *controllercmd.Controlle
"assets/kube-apiserver/storage-version-migration-prioritylevelconfiguration.yaml",
"assets/alerts/api-usage.yaml",
"assets/alerts/audit-errors.yaml",
"assets/alerts/client-go-requests.yaml",
"assets/alerts/cpu-utilization.yaml",
"assets/alerts/kube-apiserver-requests.yaml",
"assets/alerts/kube-apiserver-slos-basic.yaml",
Expand Down