-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add counter metric to track alerts dropped for being outside of active/within muted time bounds #3512
Comments
I'm favour of this -- but I think the name should be |
tjhop
added a commit
to tjhop/alertmanager
that referenced
this issue
Oct 20, 2023
Addresses: prometheus#3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
tjhop
added a commit
to tjhop/alertmanager
that referenced
this issue
Nov 28, 2023
Addresses: prometheus#3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com>
gotjosh
added a commit
that referenced
this issue
Feb 13, 2024
…3565) * feat: add counter to track alerts dropped outside of time_intervals Addresses: #3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * test: add time interval suppression metric checks for notify Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * test: fix failure message log values in notifier Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * ref: address PR feedback for #3565 Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * fix: track suppressed notifications metric for inhibit/silence Based on PR feedback: https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026 Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * fix: broken notifier tests - fixed metric count check to properly check the diff between input/output notifications from the suppression to compare to suppression metric, was previously inverted to compare to how many notifications it suppressed. - stopped using `Reset()` to compare collection counts between the multiple stages that are executed in `TestMuteStageWithSilences()`. the intent was to compare a clean metric collection after each stage execution, but the final stage where all silences are lifted results in no metric being created in the test, causing `prom_testutil.ToFloat64()` to panic. changed to separate vars to check counts between each stage, with care to consider prior counts. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * rename metric and add constants Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>
th0th
pushed a commit
to th0th/alertmanager
that referenced
this issue
Mar 23, 2024
…rometheus#3565) * feat: add counter to track alerts dropped outside of time_intervals Addresses: prometheus#3512 This adds a new counter metric `alertmanager_alerts_supressed_total` that is incremented by `len(alerts)` when an alert is suppressed for being outside of a time_interval, ie inside of a mute_time_intervals or outside of an active_time_intervals. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * test: add time interval suppression metric checks for notify Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * test: fix failure message log values in notifier Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * ref: address PR feedback for prometheus#3565 Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * fix: track suppressed notifications metric for inhibit/silence Based on PR feedback: https://github.com/prometheus/alertmanager/pull/3565/files#r1393068026 Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * fix: broken notifier tests - fixed metric count check to properly check the diff between input/output notifications from the suppression to compare to suppression metric, was previously inverted to compare to how many notifications it suppressed. - stopped using `Reset()` to compare collection counts between the multiple stages that are executed in `TestMuteStageWithSilences()`. the intent was to compare a clean metric collection after each stage execution, but the final stage where all silences are lifted results in no metric being created in the test, causing `prom_testutil.ToFloat64()` to panic. changed to separate vars to check counts between each stage, with care to consider prior counts. Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> * rename metric and add constants Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: TJ Hoplock <t.hoplock@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> Signed-off-by: Gokhan Sari <gokhan@sari.me>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When using time intervals for routes, alerts received outside of the correct time window are dropped with a log message:
alertmanager/notify/notify.go
Lines 841 to 846 in 6ce841c
alertmanager/notify/notify.go
Lines 878 to 884 in 6ce841c
Imagine a scenario where:
Since alerts received outside of valid time intervals aren't retried and are outright dropped, it could be helpful to the NOC to have a count of how many alerts have been dropped for tracking or meta-alerting purposes (for example, a high rate of dropped alerts could indicate a more severe issue and justify the NOC reaching out to subject matter experts, even if outside of business hours).
Using the existing log lines, it is possible to, for example, create a recording rule within loki and generate a metric off of it. However, the log entries don't inform the user how many alerts were dropped.
I'm proposing a new counter metric
alertmanager_alerts_dropped_total
should be created, and when alerts are dropped, it should be incremented by how many alerts were dropped (ie,counterVar.Add(len(alerts))
)The text was updated successfully, but these errors were encountered: