Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add counter metric to track alerts dropped for being outside of active/within muted time bounds #3512

Closed
tjhop opened this issue Sep 6, 2023 · 1 comment

Comments

@tjhop
Copy link
Contributor

tjhop commented Sep 6, 2023

When using time intervals for routes, alerts received outside of the correct time window are dropped with a log message:

// If the current time is inside a mute time, all alerts are removed from the pipeline.
if muted {
level.Debug(l).Log("msg", "Notifications not sent, route is within mute time")
return ctx, nil, nil
}
return ctx, alerts, nil

// If the current time is not inside an active time, all alerts are removed from the pipeline
if !active {
level.Debug(l).Log("msg", "Notifications not sent, route is not within active time")
return ctx, nil, nil
}
return ctx, alerts, nil

Imagine a scenario where:

  1. there is >=1 routes with time intervals set (eg, engineering teams wanting alerts during business hours)
  2. there is a NOC or similar team available to triage alerts 24/7

Since alerts received outside of valid time intervals aren't retried and are outright dropped, it could be helpful to the NOC to have a count of how many alerts have been dropped for tracking or meta-alerting purposes (for example, a high rate of dropped alerts could indicate a more severe issue and justify the NOC reaching out to subject matter experts, even if outside of business hours).

Using the existing log lines, it is possible to, for example, create a recording rule within loki and generate a metric off of it. However, the log entries don't inform the user how many alerts were dropped.

I'm proposing a new counter metric alertmanager_alerts_dropped_total should be created, and when alerts are dropped, it should be incremented by how many alerts were dropped (ie, counterVar.Add(len(alerts)))

@gotjosh
Copy link
Member

gotjosh commented Oct 19, 2023

I'm favour of this -- but I think the name should be alertmanager_alerts_supressed_total as this is the terminology we use when an alert is muted.

tjhop added a commit to tjhop/alertmanager that referenced this issue Oct 20, 2023
Addresses: prometheus#3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
tjhop added a commit to tjhop/alertmanager that referenced this issue Nov 28, 2023
Addresses: prometheus#3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
gotjosh added a commit that referenced this issue Feb 13, 2024
…3565)

* feat: add counter to track alerts dropped outside of time_intervals

Addresses: #3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* test: add time interval suppression metric checks for notify

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* test: fix failure message log values in notifier

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* ref: address PR feedback for #3565

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: track suppressed notifications metric for inhibit/silence

Based on PR feedback:

https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: broken notifier tests

- fixed metric count check to properly check the diff between
  input/output notifications from the suppression to compare to suppression
metric, was previously inverted to compare to how many notifications it
suppressed.
- stopped using `Reset()` to compare collection counts between the
  multiple stages that are executed in `TestMuteStageWithSilences()`.
the intent was to compare a clean metric collection after each stage
execution, but the final stage where all silences are lifted results in
no metric being created in the test, causing `prom_testutil.ToFloat64()`
to panic. changed to separate vars to check counts between each stage,
with care to consider prior counts.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* rename metric and add constants

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
th0th pushed a commit to th0th/alertmanager that referenced this issue Mar 23, 2024
…rometheus#3565)

* feat: add counter to track alerts dropped outside of time_intervals

Addresses: prometheus#3512

This adds a new counter metric `alertmanager_alerts_supressed_total`
that is incremented by `len(alerts)` when an alert is suppressed for
being outside of a time_interval, ie inside of a mute_time_intervals or
outside of an active_time_intervals.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* test: add time interval suppression metric checks for notify

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* test: fix failure message log values in notifier

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* ref: address PR feedback for prometheus#3565

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: track suppressed notifications metric for inhibit/silence

Based on PR feedback:

https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* fix: broken notifier tests

- fixed metric count check to properly check the diff between
  input/output notifications from the suppression to compare to suppression
metric, was previously inverted to compare to how many notifications it
suppressed.
- stopped using `Reset()` to compare collection counts between the
  multiple stages that are executed in `TestMuteStageWithSilences()`.
the intent was to compare a clean metric collection after each stage
execution, but the final stage where all silences are lifted results in
no metric being created in the test, causing `prom_testutil.ToFloat64()`
to panic. changed to separate vars to check counts between each stage,
with care to consider prior counts.

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>

* rename metric and add constants

Signed-off-by: gotjosh <josue.abreu@gmail.com>

---------

Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: Gokhan Sari <gokhan@sari.me>
@tjhop tjhop closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants