fix(marker): stop state leakage from aggregation groups #4438

siavashs · 2025-06-23T15:27:11Z

This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in #4402.

Fixes: #4402

types/types.go

SuperQ

Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?

This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in prometheus#4402. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs · 2025-10-18T22:30:46Z

Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?

Done, added 3 scenarios. The code is change to make sure that we don't accidentally delete a marker due to a race condition. Please have a look and let me know if we should add more.

But while adding these I noticed the whole Marker setup to be a week point in terms of consistency, we need to come up with a better solution at some point.

SuperQ

Nice, thanks!

grobinson-grafana · 2025-11-06T20:18:14Z

dispatch/dispatch.go

 			ag.logger.Error("error on delete alerts", "err", err)
+		} else {
+			// Delete markers for resolved alerts that are not in the store.
+			for _, alert := range resolvedSlice {


Just heads up this is a race condition for the same reason we had to implement DeleteIfNotModified. There is a missing check to make sure the deleted alert is the same alert returned from ag.alerts.Get(alert.Fingerprint()).

Without this check what can happen is we delete the alert in between DeletedIfNotModified and then before we call Get a new alert is received. What happens then is we delete the marker of the new alert by mistake.

As we discussed on Slack, this is safe since even if we delete the marker the alert status would then be unprocessed:

alertmanager/types/types.go

Lines 261 to 274 in 52eb1fc

// Status implements AlertMarker.

func (m *MemMarker) Status(alert model.Fingerprint) AlertStatus {

m.mtx.RLock()

defer m.mtx.RUnlock()

if s, found := m.alerts[alert]; found {

return *s

}

return AlertStatus{

State: AlertStateUnprocessed,

SilencedBy: []string{},

InhibitedBy: []string{},

}

}

siavashs marked this pull request as draft June 23, 2025 15:30

siavashs force-pushed the marker-gc branch from 8f1e2ca to 5bf5ad2 Compare June 23, 2025 15:33

OGKevin reviewed Aug 7, 2025

View reviewed changes

types/types.go Outdated Show resolved Hide resolved

siavashs force-pushed the marker-gc branch 3 times, most recently from 20c2483 to cdff2d7 Compare August 27, 2025 08:43

siavashs marked this pull request as ready for review August 27, 2025 08:55

siavashs force-pushed the marker-gc branch 2 times, most recently from 101e444 to 83a127e Compare August 27, 2025 11:37

siavashs changed the title ~~feat(marker): add gc~~ fix(marker): stop state leakage from aggregation groups Aug 27, 2025

SuperQ reviewed Oct 17, 2025

View reviewed changes

siavashs force-pushed the marker-gc branch from 397f32a to 7bd0e45 Compare October 18, 2025 22:25

siavashs requested a review from SuperQ October 18, 2025 22:25

fix(marker): stop state leakage from aggregation groups

d7f2c92

This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in prometheus#4402. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs force-pushed the marker-gc branch from 7bd0e45 to d7f2c92 Compare October 18, 2025 22:27

SuperQ approved these changes Oct 19, 2025

View reviewed changes

SuperQ merged commit f7a0d01 into prometheus:main Oct 19, 2025
11 checks passed

siavashs deleted the marker-gc branch October 19, 2025 15:52

grobinson-grafana reviewed Nov 6, 2025

View reviewed changes

SoloJacobs mentioned this pull request Nov 24, 2025

Release v0.30.0-rc.0 #4770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(marker): stop state leakage from aggregation groups #4438

fix(marker): stop state leakage from aggregation groups #4438

Uh oh!

siavashs commented Jun 23, 2025 •

edited by SuperQ

Loading

Uh oh!

Uh oh!

SuperQ left a comment

Uh oh!

siavashs commented Oct 18, 2025

Uh oh!

SuperQ left a comment

Uh oh!

Uh oh!

grobinson-grafana Nov 6, 2025 •

edited

Loading

Uh oh!

siavashs Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Status implements AlertMarker.
	func (m *MemMarker) Status(alert model.Fingerprint) AlertStatus {
	m.mtx.RLock()
	defer m.mtx.RUnlock()

	if s, found := m.alerts[alert]; found {
	return *s
	}
	return AlertStatus{
	State: AlertStateUnprocessed,
	SilencedBy: []string{},
	InhibitedBy: []string{},
	}
	}

fix(marker): stop state leakage from aggregation groups #4438

fix(marker): stop state leakage from aggregation groups #4438

Uh oh!

Conversation

siavashs commented Jun 23, 2025 • edited by SuperQ Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

siavashs commented Oct 18, 2025

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grobinson-grafana Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siavashs Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

siavashs commented Jun 23, 2025 •

edited by SuperQ

Loading

grobinson-grafana Nov 6, 2025 •

edited

Loading