Skip to content

Commit

Permalink
Adjust and rename ThanosSidecarUnhealthy to `ThanosSidecarNoConnect…
Browse files Browse the repository at this point in the history
…ionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics (#4508)

* Refactor sidecar alerts

Prior to this fix, ThanosSidecarUnhealthy would fire even when
Prometheus is busy with WAL replay. This would trigger a false positive alert.

This PR considers prometheus_tsdb_data_replay_duration_seconds metric from
Prometheus for ThanosSidecarUnhealthy alert. In order to correlate
Thanos and Prometheus metrics we need to specify common label(s) which
can be confiured through thanosPrometheusCommonDimensions jsonnet
variable.

This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy.

Fixes #3915.

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Simplify ThanosSidecarNoConnectionToStartedPrometheus using thanos_sidecar_prometheus_up

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Remove unused implementation of thanos_sidecar_last_heartbeat_success_time_seconds metric

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
  • Loading branch information
arajkumar and bwplotka authored Sep 24, 2021
1 parent 0d524a9 commit d5351b0
Show file tree
Hide file tree
Showing 10 changed files with 71 additions and 152 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#4679](https://github.com/thanos-io/thanos/pull/4679) Added `enable-feature` flag to enable negative offsets and @ modifier, similar to Prometheus.
- [#4696](https://github.com/thanos-io/thanos/pull/4696) Query: add cache name to tracing spans.

### Fixed

- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjust and rename `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics.

## v0.23.0 - In Progress

### Added
Expand Down
6 changes: 0 additions & 6 deletions cmd/thanos/sidecar.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,6 @@ func runSidecar(
Name: "thanos_sidecar_prometheus_up",
Help: "Boolean indicator whether the sidecar can reach its Prometheus peer.",
})
lastHeartbeat := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
Name: "thanos_sidecar_last_heartbeat_success_time_seconds",
Help: "Timestamp of the last successful heartbeat in seconds.",
})

ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
Expand Down Expand Up @@ -191,7 +187,6 @@ func runSidecar(
)
promUp.Set(1)
statusProber.Ready()
lastHeartbeat.SetToCurrentTime()
return nil
})
if err != nil {
Expand All @@ -213,7 +208,6 @@ func runSidecar(
promUp.Set(0)
} else {
promUp.Set(1)
lastHeartbeat.SetToCurrentTime()
}

return nil
Expand Down
24 changes: 8 additions & 16 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,16 +296,6 @@ rules:
```yaml mdox-exec="cat examples/tmp/thanos-sidecar.yaml"
name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand All @@ -316,14 +306,16 @@ rules:
for: 5m
labels:
severity: critical
- alert: ThanosSidecarUnhealthy
- alert: ThanosSidecarNoConnectionToStartedPrometheus
annotations:
description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}}
seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
description: Thanos Sidecar {{$labels.instance}} is unhealthy.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus
summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
healthy and has reloaded WAL.
expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
AND on (namespace, pod)
prometheus_tsdb_data_replay_duration_seconds != 0
for: 5m
labels:
severity: critical
Expand Down
24 changes: 8 additions & 16 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -301,16 +301,6 @@ groups:
severity: warning
- name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand All @@ -321,14 +311,16 @@ groups:
for: 5m
labels:
severity: critical
- alert: ThanosSidecarUnhealthy
- alert: ThanosSidecarNoConnectionToStartedPrometheus
annotations:
description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than
{{$value}} seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
description: Thanos Sidecar {{$labels.instance}} is unhealthy.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus
summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
healthy and has reloaded WAL.
expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
AND on (namespace, pod)
prometheus_tsdb_data_replay_duration_seconds != 0
for: 5m
labels:
severity: critical
Expand Down
133 changes: 40 additions & 93 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,127 +7,74 @@ evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0"}'
values: '5 10 43 17 11 0 0 0'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1"}'
values: '4 9 42 15 10 0 0 0'
promql_expr_test:
- expr: time()
eval_time: 1m
exp_samples:
- labels: '{}'
value: 60
- expr: time()
eval_time: 2m
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 2m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 43
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 10m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 0
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 11m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 0
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 10m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 600
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 11m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 660
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance) >= 600
eval_time: 12m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 720
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 720
- series: 'thanos_sidecar_prometheus_up{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0", pod="prometheus-0"}'
values: '1x5 0x15'
- series: 'thanos_sidecar_prometheus_up{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1", pod="prometheus-1"}'
values: '1x4 0x15'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-0", pod="prometheus-0"}'
values: '4x5 0x5 5x15'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-1", pod="prometheus-1"}'
values: '10x14 0x6'
alert_rule_test:
- eval_time: 1m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 2m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 20m
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
instance: thanos-sidecar-0
namespace: production
pod: prometheus-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'

- interval: 1m
input_series:
- series: 'prometheus_rule_evaluations_total{namespace="production", job="thanos-ruler", instance="thanos-ruler-0"}'
Expand Down
1 change: 1 addition & 0 deletions mixin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ This project is intended to be used as a library. You can extend and customize d
},
sidecar+:: {
selector: 'job=~".*thanos-sidecar.*"',
thanosPrometheusCommonDimensions: 'namespace, pod',
title: '%(prefix)sSidecar' % $.dashboard.prefix,
},
// TODO(kakkoyun): Fix naming convention: bucketReplicate
Expand Down
25 changes: 7 additions & 18 deletions mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
local thanos = self,
sidecar+:: {
selector: error 'must provide selector for Thanos Sidecar alerts',
thanosPrometheusCommonDimensions: error 'must provide commonDimensions between Thanos and Prometheus metrics for Sidecar alerts',
dimensions: std.join(', ', std.objectFields(thanos.targetGroups) + ['job', 'instance']),
},
prometheusAlerts+:: {
Expand All @@ -10,20 +11,6 @@
{
name: 'thanos-sidecar',
rules: [
{
alert: 'ThanosSidecarPrometheusDown',
annotations: {
description: 'Thanos Sidecar {{$labels.instance}}%s cannot connect to Prometheus.' % location,
summary: 'Thanos Sidecar cannot connect to Prometheus',
},
expr: |||
thanos_sidecar_prometheus_up{%(selector)s} == 0
||| % thanos.sidecar,
'for': '5m',
labels: {
severity: 'critical',
},
},
{
alert: 'ThanosSidecarBucketOperationsFailed',
annotations: {
Expand All @@ -39,13 +26,15 @@
},
},
{
alert: 'ThanosSidecarUnhealthy',
alert: 'ThanosSidecarNoConnectionToStartedPrometheus',
annotations: {
description: 'Thanos Sidecar {{$labels.instance}}%s is unhealthy for more than {{$value}} seconds.' % location,
summary: 'Thanos Sidecar is unhealthy.',
description: 'Thanos Sidecar {{$labels.instance}}%s is unhealthy.' % location,
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.',
},
expr: |||
time() - max by (%(dimensions)s) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
thanos_sidecar_prometheus_up{%(selector)s} == 0
AND on (%(thanosPrometheusCommonDimensions)s)
prometheus_tsdb_data_replay_duration_seconds != 0
||| % thanos.sidecar,
'for': '5m',
labels: {
Expand Down
1 change: 1 addition & 0 deletions mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
},
sidecar+:: {
selector: 'job=~".*thanos-sidecar.*"',
thanosPrometheusCommonDimensions: 'namespace, pod',
title: '%(prefix)sSidecar' % $.dashboard.prefix,
},
// TODO(kakkoyun): Fix naming convention: bucketReplicate
Expand Down
3 changes: 1 addition & 2 deletions mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,8 @@

|Name|Summary|Description|Severity|Runbook|
|---|---|---|---|---|
|ThanosSidecarPrometheusDown|Thanos Sidecar cannot connect to Prometheus|Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown)|
|ThanosSidecarBucketOperationsFailed|Thanos Sidecar bucket operations are failing|Thanos Sidecar {{$labels.instance}} bucket operations are failing|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed)|
|ThanosSidecarUnhealthy|Thanos Sidecar is unhealthy.|Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}} seconds.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy)|
|ThanosSidecarNoConnectionToStartedPrometheus|Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.|Thanos Sidecar {{$labels.instance}} is unhealthy.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus)|

## thanos-store

Expand Down
2 changes: 1 addition & 1 deletion pkg/rules/rules_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ func testRulesAgainstExamples(t *testing.T, dir string, server rulespb.RulesServ
{
Name: "thanos-sidecar",
File: filepath.Join(dir, "alerts.yaml"),
Rules: []*rulespb.Rule{someAlert, someAlert, someAlert},
Rules: []*rulespb.Rule{someAlert, someAlert},
Interval: 60,
PartialResponseStrategy: storepb.PartialResponseStrategy_ABORT,
},
Expand Down

0 comments on commit d5351b0

Please sign in to comment.