Skip to content

Commit

Permalink
Fix KubeClientCertificateExpiration alerts (#941)
Browse files Browse the repository at this point in the history
1) Change aggregation `by (le)` to `without(service,endpoint...)`, dropping only useless labels, but keeping external labels (like environment etc) intact. Otherwise they get dropped.
2) Change order of metrics in expression: `apiserver_client_certificate_expiration_seconds_bucket` metric comes first so actual expiration date is shown as result in Grafana->Explore queries, not `apiserver_client_certificate_expiration_seconds_count` value (which is quite useless). This make it easier to troubleshoot.
3) Finally, fix aggregation for `on(job)` to become `(job, cluster, instance)`. Otherwise, It would be enough to have just single instance with certificate expiration problem, and it would set all apiservers to 'firing' (false positive!).
  • Loading branch information
7840vz authored Nov 7, 2024
1 parent c70f03d commit 3830dfd
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions alerts/kube_apiserver.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ local utils = import '../lib/utils.libsonnet';
{
alert: 'KubeClientCertificateExpiration',
expr: |||
apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0 and on(%(clusterLabel)s, job) histogram_quantile(0.01, sum by (%(clusterLabel)s, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationWarningSeconds)s
histogram_quantile(0.01, sum without (%(namespaceLabel)s, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationWarningSeconds)s
and
on(job, %(clusterLabel)s, instance) apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0
||| % $._config,
'for': '5m',
labels: {
Expand All @@ -64,7 +66,9 @@ local utils = import '../lib/utils.libsonnet';
{
alert: 'KubeClientCertificateExpiration',
expr: |||
apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0 and on(%(clusterLabel)s, job) histogram_quantile(0.01, sum by (%(clusterLabel)s, job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationCriticalSeconds)s
histogram_quantile(0.01, sum without (%(namespaceLabel)s, service, endpoint) (rate(apiserver_client_certificate_expiration_seconds_bucket{%(kubeApiserverSelector)s}[5m]))) < %(certExpirationCriticalSeconds)s
and
on(job, %(clusterLabel)s, instance) apiserver_client_certificate_expiration_seconds_count{%(kubeApiserverSelector)s} > 0
||| % $._config,
'for': '5m',
labels: {
Expand Down

0 comments on commit 3830dfd

Please sign in to comment.