Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics #4508

Merged
merged 4 commits into from
Sep 24, 2021

Conversation

arajkumar
Copy link
Contributor

@arajkumar arajkumar commented Aug 2, 2021

Prior to this fix, ThanosSidecarUnhealthy would fire even when
Prometheus is busy with WAL replay. This would trigger a false positive alert.

This PR considers prometheus_tsdb_data_replay_duration_seconds metric from
Prometheus for ThanosSidecarUnhealthy alert. In order to correlate
Thanos and Prometheus metrics we need to specify common label(s) which
can be configured through thanosPrometheusCommonDimensions jsonnet
variable.

This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy.

Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus.

Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metric.

Fixes #3915.

Signed-off-by: Arunprasad Rajkumar arajkuma@redhat.com

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

bwplotka
bwplotka previously approved these changes Aug 5, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if this is right approach. Sounds like we would like to set some dependency and alert only if Prometheus is healthy but sidecar is not.

I think in this case that might work, but let's make sure to be explicit. I would even vote for alert name to be explit, to something SidecarNoConnectionToStartedPrometheus, WDYT?

examples/alerts/alerts.yaml Outdated Show resolved Hide resolved
examples/alerts/alerts.yaml Outdated Show resolved Hide resolved
@arajkumar
Copy link
Contributor Author

Thanks a lot @bwplotka for this quick review.

I am not 100% sure if this is right approach. Sounds like we would like to set some dependency and alert only if Prometheus is healthy but sidecar is not.

Do you mean to express alert dependencies through inhibition rules? If so, I thought of that and @simonpasquier suggested that it might over complicate.

I think in this case that might work, but let's make sure to be explicit. I would even vote for alert name to be explit, to something SidecarNoConnectionToStartedPrometheus, WDYT?

+1, How about ThanosSidecarNotConnectedToPrometheus?

@bwplotka
Copy link
Member

bwplotka commented Aug 5, 2021

Yes it might be hard, especially when one might not alert when Prometheus is reloading (which makes sense!)

I would prefer what I propose as from title you expect this failing when it's not connected, but it will NOT fail if not connected AND if Prometheus is reloading which might surprise op. (:

@arajkumar arajkumar changed the title fix: consider prometheus wal replay for ThanosSidecarUnhealthy alert fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy Aug 9, 2021
@arajkumar arajkumar changed the title fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy Aug 9, 2021
@arajkumar arajkumar requested a review from bwplotka August 9, 2021 09:52
CHANGELOG.md Outdated
@@ -18,6 +18,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re

- [#4468](https://github.com/thanos-io/thanos/pull/4468) Rule: Fix temporary rule filename composition issue.
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol.
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry does not make sense (fix ThanosSidecarUnhealthy and remove ThanosSidecarUnhealthy) - what we try to say here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it was a typo :'(

examples/alerts/alerts.yaml Show resolved Hide resolved
@bwplotka
Copy link
Member

What about alert name?

What about ThanosSidecarNoConnectionToStartedPrometheus?

@arajkumar arajkumar changed the title Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarUnhealthy Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarPrometheusDown Aug 10, 2021
@arajkumar
Copy link
Contributor Author

ThanosSidecarNoConnectionToStartedPrometheus

@bwplotka looks good to me.

@arajkumar arajkumar changed the title Fix ThanosSidecarUnhealthy alert during prometheus WAL replay and remove ThanosSidecarPrometheusDown Refactor sidecar alerts Aug 10, 2021
bwplotka
bwplotka previously approved these changes Aug 17, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's fix changelog, otherwise LGTM 💪🏽

CHANGELOG.md Outdated
@@ -18,6 +18,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re

- [#4468](https://github.com/thanos-io/thanos/pull/4468) Rule: Fix temporary rule filename composition issue.
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol.
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Refactor sidecar alerts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not useful changelog item, is it? (:

Copy link
Contributor Author

@arajkumar arajkumar Aug 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have, how about "Refactor sidecar alerts(ThanosSidecarUnhealthy, ThanosSidecarPrometheusDown)"?

Since this PR does 3 things, found it bit hard to put a concise title :)

  1. fix ThanosSidecarUnhealthy
  2. removes ThanosSidecarPrometheusDown
  3. rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be specific here, otherwise it's hard for user to learn quickly if they are impacted or not.

What about

Suggested change
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Refactor sidecar alerts.
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjusted and renamed `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Removed `ThanosSidecarPrometheusDown`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka Applied your suggestion, PTAL.

examples/alerts/alerts.yaml Show resolved Hide resolved
@arajkumar arajkumar changed the title Refactor sidecar alerts Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown Aug 26, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more suggestions. Also please run make docs again - tests are failing.

Thanks!

expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
time() - max by (pod, job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for challenging this last minute, but isn't expression with thanos_sidecar_prometheus_up simpler?

Suggested change
time() - max by (pod, job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"}) == 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question :) AFAIK, thanos_sidecar_last_heartbeat_success_time_seconds and thanos_sidecar_prometheus_up carries the same information encoded in different ways. Let me evaluate and switch to thanos_sidecar_prometheus_up if it is the case.

Copy link
Contributor Author

@arajkumar arajkumar Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwplotka If we switch to thanos_sidecar_prometheus_up , We may not able to show the duration at which sidecar is not connected in the alert description. Hope that is fine.
e.g.

Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 720 seconds.

pkg/rules/rules_test.go Show resolved Hide resolved
@arajkumar
Copy link
Contributor Author

@bwplotka I don't know, for some reason make check-docs never fails on machine but always fails on CI. Looks like something related to line ending on OSX vs Linux.

bwplotka
bwplotka previously approved these changes Sep 20, 2021
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, just we need to move the changelog item now and add one more info.

CHANGELOG.md Outdated
@@ -28,6 +28,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#4476](https://github.com/thanos-io/thanos/pull/4476) UI: fix incorrect html escape sequence used for '>' symbol.
- [#4532](https://github.com/thanos-io/thanos/pull/4532) Mixin: Fixed "all jobs" selector in thanos mixin dashboards.
- [#4607](https://github.com/thanos-io/thanos/pull/4607) Azure: Fix Azure MSI Rate Limit
- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjust and rename `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be moved now to the top of this change log (different release)

cmd/thanos/sidecar.go Show resolved Hide resolved
@arajkumar arajkumar changed the title Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown; Remove thanos_sidecar_last_heartbeat_success_time_seconds metrics Sep 20, 2021
@arajkumar arajkumar changed the title Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown; Remove thanos_sidecar_last_heartbeat_success_time_seconds metrics Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove unused ThanosSidecarPrometheusDown; Remove thanos_sidecar_last_heartbeat_success_time_seconds metrics Sep 20, 2021
@arajkumar arajkumar changed the title Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove unused ThanosSidecarPrometheusDown; Remove thanos_sidecar_last_heartbeat_success_time_seconds metrics Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics Sep 20, 2021
@arajkumar arajkumar changed the title Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics Sep 20, 2021
arajkumar and others added 3 commits September 21, 2021 14:34
Prior to this fix, ThanosSidecarUnhealthy would fire even when
Prometheus is busy with WAL replay. This would trigger a false positive alert.

This PR considers prometheus_tsdb_data_replay_duration_seconds metric from
Prometheus for ThanosSidecarUnhealthy alert. In order to correlate
Thanos and Prometheus metrics we need to specify common label(s) which
can be confiured through thanosPrometheusCommonDimensions jsonnet
variable.

This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy.

Fixes #3915.

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
…ometheus

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
…decar_prometheus_up

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
…_time_seconds metric

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽 Thanks!

@bwplotka bwplotka merged commit d5351b0 into thanos-io:main Sep 24, 2021
arajkumar added a commit to arajkumar/kube-prometheus that referenced this pull request Sep 27, 2021
This commit pulls latest changes form thanos mixins and sets `thanosPrometheusCommonDimensions`
to `namespace, pod` for k8s use case.

Refer thanos-io/thanos#4508 for more details.

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
arajkumar added a commit to arajkumar/kube-prometheus that referenced this pull request Sep 27, 2021
This commit pulls latest changes from thanos mixins and sets `thanosPrometheusCommonDimensions`
to `namespace, pod` for k8s use case.

Refer thanos-io/thanos#4508 for more details.

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay
2 participants