Fix race condition in dispatch.go #3826

grobinson-grafana · 2024-05-01T10:27:59Z

This commit fixes a race condition in dispatch.go that would cause a firing alert to be deleted from the aggregation group when instead it should have been flushed.

The root cause is a race condition that can occur when dispatch.go deletes resolved alerts from the aggregation group following a successful notification. If a firing alert with the same fingerprint is added back to the aggregation group at the same time then the firing alert can be deleted.

The issue was first reported in #3006.

How to observe the race condition

git checkout v0.27.0
Add the following sleep to dispatch.go. This makes it easier to observe the race condition because the scheduler can resume the test between ag.alerts.Get(fp) and ag.alerts.Delete(fp), during which it should replace the resolved alert with a firing alert.

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 640b22ab..1eafa054 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -521,6 +521,7 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) {
                        // again since we notified about it.
                        fp := a.Fingerprint()
                        got, err := ag.alerts.Get(fp)
+                       <-time.After(time.Millisecond)
                        if err != nil {
                                // This should never happen.
                                level.Error(ag.logger).Log("msg", "failed to get alert", "err", err, "alert", a.String())

Add and run the following test to dispatch_test.go. It should fail:

type bufferedStage struct {
	c chan struct{}
}

func (s *bufferedStage) Exec(ctx context.Context, _ log.Logger, _ ...*types.Alert) (context.Context, []*types.Alert, error) {
	s.c <- struct{}{}
	return ctx, nil, nil
}

func TestDispatcherRaceFiringAlertDeleted(t *testing.T) {
	r := Route{RouteOpts: RouteOpts{GroupWait: 0, GroupInterval: time.Second}}
	s := bufferedStage{c: make(chan struct{}, 1)}
	logger := log.NewLogfmtLogger(os.Stdout)
	marker := types.NewMarker(prometheus.NewRegistry())
	alerts, err := mem.NewAlerts(context.Background(), marker, time.Hour, nil, logger, nil)
	require.NoError(t, err)
	defer alerts.Close()
	timeout := func(d time.Duration) time.Duration { return time.Duration(0) }
	dispatcher := NewDispatcher(alerts, &r, &s, marker, timeout, nil, logger, NewDispatcherMetrics(false, prometheus.NewRegistry()))
	go dispatcher.Run()
	defer dispatcher.Stop()

	// Add a new firing alert and wait for it to be flushed.
	a1 := newAlert(model.LabelSet{"foo": "bar"})
	a1.StartsAt = time.Now().Add(-time.Second)
	a1.UpdatedAt = a1.StartsAt
	require.NoError(t, alerts.Put(a1))
	<-s.c

	// Change the firing alert to resolved and wait for it to be flushed too.
	a2 := newAlert(model.LabelSet{"foo": "bar"})
	a2.StartsAt = a1.StartsAt
	a2.UpdatedAt = a1.UpdatedAt
	a2.EndsAt = time.Now()
	require.NoError(t, alerts.Put(a2))
	<-s.c

	// Change the resolved alert back to firing. If the race condition is still
	// present, it will be deleted instead of flushed.
	a3 := newAlert(model.LabelSet{"foo": "bar"})
	a3.StartsAt = time.Now()
	a3.UpdatedAt = time.Now()
	require.NoError(t, alerts.Put(a3))
	select {
	case <-s.c:
	case <-time.After(2 * time.Second):
		t.Fatal()
	}
}

How to observe the fix

Check out this branch, add the sleep and re-run the same test. It should pass.

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index d78fbf68..efb0c246 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -517,6 +517,7 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) {

        if notify(alertsSlice...) {
                for _, a := range alertsSlice {
+                       <-time.After(time.Millisecond)
                        // Only delete if the fingerprint has not been inserted again since we notified about it.
                        if err := ag.alerts.DeleteIf(a.Fingerprint(), func(b *types.Alert) bool {
                                return a.Resolved() && a.UpdatedAt == b.UpdatedAt

This commit fixes a race condition in dispatch.go that would cause a firing alert to be deleted from the aggregation group when instead it should have been flushed. The root cause is a race condition that can occur when dispatch.go deletes resolved alerts from the aggregation group following a successful notification. If a firing alert with the same fingerprint is added back to the aggregation group at the same time then the firing alert can be deleted. Signed-off-by: George Robinson <george.robinson@grafana.com>

Signed-off-by: George Robinson <george.robinson@grafana.com>

store/store.go

zecke · 2024-05-01T14:07:11Z

store/store_test.go

+	alert := &types.Alert{
+		UpdatedAt: time.Now(),
+	}
+	require.NoError(t, a.Set(alert))


Should we add another alert and verify it isn't deleted (by accident)?

zecke · 2024-05-01T14:07:55Z

LGTM. Definitely closes a time of check/time of use kind of problem. Now Set/DeleteIf/gc are protected by the same mutex.

Signed-off-by: George Robinson <george.robinson@grafana.com>

gotjosh

LGTM, but please see my comments.

Great job @grobinson-grafana !

gotjosh · 2024-05-02T10:08:36Z

dispatch/dispatch.go


 	if notify(alertsSlice...) {
 		for _, a := range alertsSlice {
-			// Only delete if the fingerprint has not been inserted


[nit] How do you feel about making the whole operation atomic? You'd only take the lock once guaranteed that all the alerts are cleared in one go. I feel this makes it easier to reason about it as there's no use case where an alert could get updated once we start checking.

diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go index d78fbf68..6e44fbc2 100644 --- a/dispatch/dispatch.go +++ b/dispatch/dispatch.go @@ -516,14 +516,7 @@ func (ag *aggrGroup) flush(notify func(...*types.Alert) bool) { level.Debug(ag.logger).Log("msg", "flushing", "alerts", fmt.Sprintf("%v", alertsSlice)) if notify(alertsSlice...) { - for _, a := range alertsSlice { - // Only delete if the fingerprint has not been inserted again since we notified about it. - if err := ag.alerts.DeleteIf(a.Fingerprint(), func(b *types.Alert) bool { - return a.Resolved() && a.UpdatedAt == b.UpdatedAt - }); err != nil { - level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String()) - } - } + ag.alerts.DeleteResolved(alertsSlice) } } diff --git a/store/store.go b/store/store.go index bffa4eb6..4110a971 100644 --- a/store/store.go +++ b/store/store.go @@ -154,3 +154,19 @@ func (a *Alerts) Empty() bool { return len(a.c) == 0 } + +func (a *Alerts) DeleteResolved(slice types.AlertSlice) { + a.Lock() + defer a.Unlock() + + for _, as := range slice { + fp := as.Fingerprint() + alert, ok := a.c[as.Fingerprint()] + if !ok { + } + + if as.Resolved() && as.UpdatedAt == alert.UpdatedAt { + delete(a.c, fp) + } + } +}

I quite like the condition being a closure, so what if we had something like:

func (a *Alerts) DeleteFingerprints(fp []model.Fingerprint, fn func(int, *types.Alert) bool) error { ... }

I did think about adding something like DeleteResolved but with fingerprints:

func (a *Alerts) DeleteResolved(fp []model.Fingerprint) error { ... }

but I decided I didn't like the semantics of it because it brought business logic from the dispatcher [1] into the store, and I felt the store should be agnostic of that kind of business logic.

[1] For example:

The name DeleteResolved felt confusing as you would expect a store with a method called DeleteResolved to delete all resolved alerts it knows about, not the resolved alerts the caller knows about (it doesn't delete all resolved alerts, just those in the argument).

If the caller thinks the alert is firing, but the store thinks the alert is resolved, then DeleteResolved will not delete the alert. This feels weird given the method is called DeleteResolved.

A resolved alert is deleted iff UpdatedAt hasn't changed since the last flush (but why does a generic store care about flushes?).

What do you think?

That's a fair point.

I'm OK with:

func (a *Alerts) DeleteFingerprints(fp []model.Fingerprint, fn func(int, *types.Alert) bool) error {

As what I care about is the atomicity of the operation.

I think your point about logic is valid but they both share the same types anyways 🤷 so I kind feel that where the logic goes doesn't matter to me as much.

In the end I couldn't make DeleteFingerprints as clean as I wanted it to be so instead I've taken a different, although still generic approach, where I've replaced DeleteIf with DeleteIfNotModified. I then use resolvedSlice from dispatch.go as the argument.

gotjosh · 2024-05-02T10:10:02Z

store/store.go

 	delete(a.c, fp)
 	return nil
 }



If we chose to go with my approach or your approach - we'd need to remove the Delete method as it is no longer used.

I think the same holds true for Get, which is also used in store but I don't see it used anywhere else (but tests).

I don't have a strong opinion here, and I'm happy to remove Get and Delete, but I did just want to raise a counter-argument for this to see if we still think it makes sense.

I understand that Get and Delete will be un-used once we merge this PR, but does that mean we should delete them? These operations seem kind of important for a store interface, even if we don't make use of them today. It's more of a question of semantics rather than pragmatics, but I feel it's important to at least discuss it.

I'm a simple man - I see code that's not used I delete it. These operations are simple enough that they can be introduced at any time, and in my experience, "just-in-case methods" tend to cause more confusion than good.

That being said - I'm also not fussed, you can take your pick here.

OK! 🙂 Get and Delete have been deleted! 👍

Oops! It turns out I cannot remove Get because it's used in provider/mem/mem.go.

Yeah, but if you look at the places where that Get is used you'll realise it's only in tests.

It's also used on Line 207 of provider/mem/mem.go which isn't a test?

Sorry, I could have been clearer. I meant the Get on this file.

diff --git a/provider/mem/mem.go b/provider/mem/mem.go index cfc3bfc3..b84133e4 100644 --- a/provider/mem/mem.go +++ b/provider/mem/mem.go @@ -20,12 +20,10 @@ import ( "github.com/go-kit/log" "github.com/go-kit/log/level" - "github.com/prometheus/client_golang/prometheus" - "github.com/prometheus/common/model" - "github.com/prometheus/alertmanager/provider" "github.com/prometheus/alertmanager/store" "github.com/prometheus/alertmanager/types" + "github.com/prometheus/client_golang/prometheus" ) const alertChannelLength = 200 @@ -190,11 +188,6 @@ func (a *Alerts) GetPending() provider.AlertIterator { return provider.NewAlertIterator(ch, done, nil) } -// Get returns the alert for a given fingerprint. -func (a *Alerts) Get(fp model.Fingerprint) (*types.Alert, error) { - return a.alerts.Get(fp) -} - // Put adds the given alert to the set. func (a *Alerts) Put(alerts ...*types.Alert) error { for _, alert := range alerts {

Discussed this offline, we can't remove it because due to the interface this struct implements.

This commit replaces DeleteIf with DeleteIfNotModified as this allows dispatch.go to delete all resolved alerts in a single atomic operation instead of having to acquire and release the store mutex one alert at a time. It also extends the test coverage in dispatch_test.go to make sure that just the resolved alerts are deleted from the aggregation group on flush. Signed-off-by: George Robinson <george.robinson@grafana.com>

grobinson-grafana · 2024-05-02T15:43:25Z

dispatch/dispatch_test.go

-	// Resolve all alerts, they should be removed after the next batch was sent.
-	a1r, a2r, a3r := *a1, *a2, *a3
-	resolved := types.AlertSlice{&a1r, &a2r, &a3r}
+	// Resolve an alert, and it should be removed after the next batch was sent.


I wanted to add this additional test for completeness.

gotjosh

LGTM

* Fix race condition in dispatch.go This commit fixes a race condition in dispatch.go that would cause a firing alert to be deleted from the aggregation group when instead it should have been flushed. The root cause is a race condition that can occur when dispatch.go deletes resolved alerts from the aggregation group following a successful notification. If a firing alert with the same fingerprint is added back to the aggregation group at the same time then the firing alert can be deleted. --------- Signed-off-by: George Robinson <george.robinson@grafana.com>

* [CHANGE] Deprecate and remove api/v1/ #2970 * [CHANGE] Remove unused feature flags #3676 * [CHANGE] Newlines in smtp password file are now ignored #3681 * [CHANGE] Change compat metrics to counters #3686 * [CHANGE] Do not register compat metrics in amtool #3713 * [CHANGE] Remove metrics from compat package #3714 * [CHANGE] Mark muted alerts #3793 * [FEATURE] Add metric for inhibit rules #3681 * [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572 * [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565 * [FEATURE] Add date and tz functions to templates #3812 * [FEATURE] Add limits for silences #3852 * [FEATURE] Add time helpers for templates #3863 * [FEATURE] Add auto GOMAXPROCS #3837 * [FEATURE] Add auto GOMEMLIMIT #3895 * [FEATURE] Add Jira receiver integration #3590 * [ENHANCEMENT] Add the receiver name to notification metrics #3045 * [ENHANCEMENT] Add the route ID to uuid #3372 * [ENHANCEMENT] Add duration to the notify success message #3559 * [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555 * [ENHANCEMENT] Add debug logs for muted alerts #3558 * [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610 * [ENHANCEMENT] Add summary to msteams notification #3616 * [ENHANCEMENT] Add context reasons to notifications failed counter #3631 * [ENHANCEMENT] Add optional native histogram support to latency metrics #3737 * [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638 * [ENHANCEMENT] Allow webex roomID from template #3801 * [BUGFIX] Add missing integrations to notify metrics #3480 * [BUGFIX] Add missing ttl in pushhover #3474 * [BUGFIX] Fix scheme required for webhook url in amtool #3409 * [BUGFIX] Remove duplicate integration from metrics #3516 * [BUGFIX] Reflect Discord's max length message limits #3597 * [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683 * [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718 * [BUGFIX] Fix log line in featurecontrol #3719 * [BUGFIX] Fix panic in acceptance tests #3592 * [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722 * [BUGFIX] Fix crash on errors when url_file is used #3800 * [BUGFIX] Fix race condition in dispatch.go #3826 * [BUGFIX] Fix race conditions in the memory alerts store #3648 * [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887 * [BUGFIX] Fix invalid silence causes incomplete updates #3898 * [BUGFIX] Fix leaking of Silences matcherCache entries #3930 * [BUGFIX] Close SMTP submission correctly to handle errors #4006 Signed-off-by: SuperQ <superq@gmail.com>

* Release v0.28.0-rc.0 * [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 --------- Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>

* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 Signed-off-by: SuperQ <superq@gmail.com>

* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add full width to adaptive card for msteamsv2 #4135 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [ENHANCEMENT] Update config to allow showing secret values when marshaled #4158 * [ENHANCEMENT] Enable templating for Jira project and issue_type #4159 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 * [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules #4177 Signed-off-by: SuperQ <superq@gmail.com>

* [CHANGE] Templating errors in the SNS integration now return an error. prometheus#3531 prometheus#3879 * [CHANGE] Adopt log/slog, drop go-kit/log prometheus#4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows prometheus#4024 * [FEATURE] Add a new Rocket.Chat integration prometheus#3600 * [FEATURE] Add a new Jira integration prometheus#3590 prometheus#3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. prometheus#3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. prometheus#3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly prometheus#3852 prometheus#3862 prometheus#3866 prometheus#3885 prometheus#3886 prometheus#3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. prometheus#3793 prometheus#3797 prometheus#3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. prometheus#4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. prometheus#3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate prometheus#4062 * [ENHANCEMENT] Build using go 1.23 prometheus#4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. prometheus#3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. prometheus#3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. prometheus#3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. prometheus#3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. prometheus#3812 * [ENHANCEMENT] Latency metrics now support native histograms. prometheus#3737 * [ENHANCEMENT] Add full width to adaptive card for msteamsv2 prometheus#4135 * [ENHANCEMENT] Add timeout option for webhook notifier. prometheus#4137 * [ENHANCEMENT] Update config to allow showing secret values when marshaled prometheus#4158 * [ENHANCEMENT] Enable templating for Jira project and issue_type prometheus#4159 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. prometheus#4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. prometheus#4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. prometheus#3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. prometheus#3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. prometheus#3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. prometheus#3826 * [BUGFIX] Fix version in APIv1 deprecation notice. prometheus#3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. prometheus#3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. prometheus#3803 * [BUGFIX] Fix deadlock on the alerts memory store. prometheus#3715 * [BUGFIX] Fix `amtool template render` when using the default values. prometheus#3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. prometheus#3728 prometheus#3745 * [BUGFIX] Fix wechat api link prometheus#4084 * [BUGFIX] Fix build info metric prometheus#4166 * [BUGFIX] Fix UTF-8 not allowed in Equal field for inhibition rules prometheus#4177 Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: heartwilltell <heartwilltell@gmail.com>

* Fix race condition in dispatch.go This commit fixes a race condition in dispatch.go that would cause a firing alert to be deleted from the aggregation group when instead it should have been flushed. The root cause is a race condition that can occur when dispatch.go deletes resolved alerts from the aggregation group following a successful notification. If a firing alert with the same fingerprint is added back to the aggregation group at the same time then the firing alert can be deleted. --------- Signed-off-by: George Robinson <george.robinson@grafana.com>

Applies prometheus#3826

* Fix race condition in dispatch.go This commit fixes a race condition in dispatch.go that would cause a firing alert to be deleted from the aggregation group when instead it should have been flushed. The root cause is a race condition that can occur when dispatch.go deletes resolved alerts from the aggregation group following a successful notification. If a firing alert with the same fingerprint is added back to the aggregation group at the same time then the firing alert can be deleted. --------- Signed-off-by: George Robinson <george.robinson@grafana.com>

* Release v0.28.0-rc.0 * [CHANGE] Templating errors in the SNS integration now return an error. prometheus#3531 prometheus#3879 * [FEATURE] Add a new Microsoft Teams integration based on Flows prometheus#4024 * [FEATURE] Add a new Rocket.Chat integration prometheus#3600 * [FEATURE] Add a new Jira integration prometheus#3590 prometheus#3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. prometheus#3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. prometheus#3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly prometheus#3852 prometheus#3862 prometheus#3866 prometheus#3885 prometheus#3886 prometheus#3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. prometheus#3793 prometheus#3797 prometheus#3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. prometheus#4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. prometheus#3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate prometheus#4062 * [ENHANCEMENT] Build using go 1.23 prometheus#4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. prometheus#3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. prometheus#3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. prometheus#3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. prometheus#3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. prometheus#3812 * [ENHANCEMENT] Latency metrics now support native histograms. prometheus#3737 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. prometheus#4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. prometheus#4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. prometheus#3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. prometheus#3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. prometheus#3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. prometheus#3826 * [BUGFIX] Fix version in APIv1 deprecation notice. prometheus#3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. prometheus#3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. prometheus#3803 * [BUGFIX] Fix deadlock on the alerts memory store. prometheus#3715 * [BUGFIX] Fix `amtool template render` when using the default values. prometheus#3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. prometheus#3728 prometheus#3745 --------- Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>

grobinson-grafana added 3 commits May 1, 2024 11:22

Fix lint

ebc8131

Signed-off-by: George Robinson <george.robinson@grafana.com>

Fix spelling mistake in comment

199eaa5

Signed-off-by: George Robinson <george.robinson@grafana.com>

zecke reviewed May 1, 2024

View reviewed changes

store/store.go Show resolved Hide resolved

zecke reviewed May 1, 2024

View reviewed changes

Add test to make sure DeleteIf deletes alert with fingerprint

f365d25

Signed-off-by: George Robinson <george.robinson@grafana.com>

gotjosh approved these changes May 2, 2024

View reviewed changes

grobinson-grafana force-pushed the grobinson/fix-race branch from 0facb18 to f365d25 Compare May 2, 2024 13:24

grobinson-grafana requested a review from gotjosh May 2, 2024 15:15

grobinson-grafana force-pushed the grobinson/fix-race branch from 12d0718 to 4c29976 Compare May 2, 2024 15:36

grobinson-grafana commented May 2, 2024

View reviewed changes

gotjosh approved these changes May 7, 2024

View reviewed changes

gotjosh merged commit ca4c90e into prometheus:main May 7, 2024

grobinson-grafana mentioned this pull request May 7, 2024

Fix repeated resolved alerts #3006

Closed

grobinson-grafana deleted the grobinson/fix-race branch June 25, 2024 15:59

SuperQ mentioned this pull request Oct 16, 2024

Release v0.28.0-rc.0 #4072

Merged

SuperQ mentioned this pull request Dec 19, 2024

Release v0.28.0 #4176

Merged

gigermocas mentioned this pull request Feb 4, 2025

Update alertmanager from 0.27.0 to 0.28.0 gigermocas/prometheus-rpm#8

Merged

titolins mentioned this pull request Mar 4, 2025

cherry-pick: Fix race condition in dispatch.go (#3826) grafana/prometheus-alertmanager#98

Merged

titolins added a commit to grafana/prometheus-alertmanager that referenced this pull request Mar 5, 2025

cherry-pick: Fix race condition in dispatch.go (prometheus#3826)

7a46c3e

Applies prometheus#3826

This was referenced Mar 5, 2025

Alertmanager: upgrade alertmanager to apply dispatcher race condition fix grafana/alerting#299

Merged

Alerting: upgrade alertmanager to apply dispatcher race condition fix grafana/mimir#10814

Merged

Fix race condition in dispatch.go #3826

Fix race condition in dispatch.go #3826

Uh oh!

Conversation

grobinson-grafana commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to observe the race condition

How to observe the fix

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zecke commented May 1, 2024

Uh oh!

gotjosh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gotjosh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grobinson-grafana commented May 1, 2024 •

edited

Loading