Skip to content

Commit

Permalink
fix: consider prometheus wal replay for ThanosSidecarUnhealthy alert
Browse files Browse the repository at this point in the history
Prior to this fix, ThanosSidecarUnhealthy would fire even when
Prometheus is busy with WAL replay. This would trigger a false positive alert.

This PR considers `prometheus_tsdb_data_replay_duration_seconds` metric from
Prometheus for ThanosSidecarUnhealthy alert. In order to correlate
Thanos and Prometheus metrics we need to specify common label(s) which
can be confiured through `thanosPrometheusCommonDimensions` jsonnet
variable.

Fixes #3915.

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
  • Loading branch information
arajkumar committed Aug 4, 2021
1 parent aa148f8 commit f2b6546
Show file tree
Hide file tree
Showing 8 changed files with 46 additions and 67 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re

- [#4468](https://github.com/thanos-io/thanos/pull/4468) Rule: Fix temporary rule filename composition issue.
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol.
- [#4508](https://github.com/thanos-io/thanos/pull/4508) fix: consider prometheus wal replay for ThanosSidecarUnhealthy alert.

### Changed

Expand Down
10 changes: 0 additions & 10 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -300,16 +300,6 @@ rules:
```yaml
name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand Down
15 changes: 4 additions & 11 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -301,16 +301,6 @@ groups:
severity: warning
- name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand All @@ -328,7 +318,10 @@ groups:
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
time() - max by (pod, job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
AND on (pod) (
min by (pod) (prometheus_tsdb_data_replay_duration_seconds) != 0
)
for: 5m
labels:
severity: critical
Expand Down
63 changes: 34 additions & 29 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@ evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0"}'
values: '5 10 43 17 11 0 0 0'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1"}'
values: '4 9 42 15 10 0 0 0'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0", pod="prometheus-0"}'
values: '5 10 43 17 10 0x5 0x10'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1", pod="prometheus-1"}'
values: '4 9 42 15 10x5 0x10'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-0", pod="prometheus-0"}'
values: '5x5 0x5 5x15'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-1", pod="prometheus-1"}'
values: '10x20'
promql_expr_test:
- expr: time()
eval_time: 1m
Expand All @@ -22,6 +26,13 @@ tests:
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 0m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 5
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 4
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 2m
exp_samples:
Expand All @@ -37,7 +48,7 @@ tests:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 11m
eval_time: 10m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 0
Expand All @@ -64,6 +75,21 @@ tests:
value: 720
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 720
- expr: min(prometheus_tsdb_data_replay_duration_seconds{job="prometheus-k8s"}) by(job, instance)
eval_time: 0m
exp_samples:
- labels: '{job="prometheus-k8s", instance="prometheus-k8s-0"}'
value: 5
- labels: '{job="prometheus-k8s", instance="prometheus-k8s-1"}'
value: 10
- expr: min(prometheus_tsdb_data_replay_duration_seconds{job="prometheus-k8s"}) by(job, instance)
eval_time: 6m
exp_samples:
- labels: '{job="prometheus-k8s", instance="prometheus-k8s-0"}'
value: 0
- labels: '{job="prometheus-k8s", instance="prometheus-k8s-1"}'
value: 10

alert_rule_test:
- eval_time: 1m
alertname: ThanosSidecarUnhealthy
Expand All @@ -74,56 +100,35 @@ tests:
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
Expand Down
20 changes: 5 additions & 15 deletions mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
local thanos = self,
sidecar+:: {
selector: error 'must provide selector for Thanos Sidecar alerts',
thanosPrometheusCommonDimensions: error 'must provide commonDimensions between Thanos and Prometheus metrics for Sidecar alerts',
dimensions: std.join(', ', std.objectFields(thanos.targetGroups) + ['job', 'instance']),
},
prometheusAlerts+:: {
Expand All @@ -10,20 +11,6 @@
{
name: 'thanos-sidecar',
rules: [
{
alert: 'ThanosSidecarPrometheusDown',
annotations: {
description: 'Thanos Sidecar {{$labels.instance}}%s cannot connect to Prometheus.' % location,
summary: 'Thanos Sidecar cannot connect to Prometheus',
},
expr: |||
thanos_sidecar_prometheus_up{%(selector)s} == 0
||| % thanos.sidecar,
'for': '5m',
labels: {
severity: 'critical',
},
},
{
alert: 'ThanosSidecarBucketOperationsFailed',
annotations: {
Expand All @@ -45,7 +32,10 @@
summary: 'Thanos Sidecar is unhealthy.',
},
expr: |||
time() - max by (%(dimensions)s) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
time() - max by (%(thanosPrometheusCommonDimensions)s, %(dimensions)s) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
AND on (%(thanosPrometheusCommonDimensions)s) (
min by (%(thanosPrometheusCommonDimensions)s) (prometheus_tsdb_data_replay_duration_seconds) != 0
)
||| % thanos.sidecar,
'for': '5m',
labels: {
Expand Down
1 change: 1 addition & 0 deletions mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
},
sidecar+:: {
selector: 'job=~".*thanos-sidecar.*"',
thanosPrometheusCommonDimensions: 'pod',
title: '%(prefix)sSidecar' % $.dashboard.prefix,
},
// TODO(kakkoyun): Fix naming convention: bucketReplicate
Expand Down
1 change: 0 additions & 1 deletion mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@

|Name|Summary|Description|Severity|Runbook|
|---|---|---|---|---|
|ThanosSidecarPrometheusDown|Thanos Sidecar cannot connect to Prometheus|Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown)|
|ThanosSidecarBucketOperationsFailed|Thanos Sidecar bucket operations are failing|Thanos Sidecar {{$labels.instance}} bucket operations are failing|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed)|
|ThanosSidecarUnhealthy|Thanos Sidecar is unhealthy.|Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}} seconds.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy)|

Expand Down
2 changes: 1 addition & 1 deletion pkg/rules/rules_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ func testRulesAgainstExamples(t *testing.T, dir string, server rulespb.RulesServ
{
Name: "thanos-sidecar",
File: filepath.Join(dir, "alerts.yaml"),
Rules: []*rulespb.Rule{someAlert, someAlert, someAlert},
Rules: []*rulespb.Rule{someAlert, someAlert},
Interval: 60,
PartialResponseStrategy: storepb.PartialResponseStrategy_ABORT,
},
Expand Down

0 comments on commit f2b6546

Please sign in to comment.