Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Jun 23, 2025

This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in #4402.

Fixes: #4402

@siavashs siavashs marked this pull request as draft June 23, 2025 15:30
@siavashs siavashs force-pushed the marker-gc branch 3 times, most recently from 20c2483 to cdff2d7 Compare August 27, 2025 08:43
@siavashs siavashs marked this pull request as ready for review August 27, 2025 08:55
@siavashs siavashs force-pushed the marker-gc branch 2 times, most recently from 101e444 to 83a127e Compare August 27, 2025 11:37
@siavashs siavashs changed the title feat(marker): add gc fix(marker): stop state leakage from aggregation groups Aug 27, 2025
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?

This change makes aggregation groups to delete resolved alerts from marker,
therefore avoiding the leakage of ghost states mentioned in prometheus#4402.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Contributor Author

Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?

Done, added 3 scenarios. The code is change to make sure that we don't accidentally delete a marker due to a race condition. Please have a look and let me know if we should add more.

But while adding these I noticed the whole Marker setup to be a week point in terms of consistency, we need to come up with a better solution at some point.

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@SuperQ SuperQ merged commit f7a0d01 into prometheus:main Oct 19, 2025
11 checks passed
@siavashs siavashs deleted the marker-gc branch October 19, 2025 15:52
ag.logger.Error("error on delete alerts", "err", err)
} else {
// Delete markers for resolved alerts that are not in the store.
for _, alert := range resolvedSlice {
Copy link
Collaborator

@grobinson-grafana grobinson-grafana Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just heads up this is a race condition for the same reason we had to implement DeleteIfNotModified. There is a missing check to make sure the deleted alert is the same alert returned from ag.alerts.Get(alert.Fingerprint()).

Without this check what can happen is we delete the alert in between DeletedIfNotModified and then before we call Get a new alert is received. What happens then is we delete the marker of the new alert by mistake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on Slack, this is safe since even if we delete the marker the alert status would then be unprocessed:

alertmanager/types/types.go

Lines 261 to 274 in 52eb1fc

// Status implements AlertMarker.
func (m *MemMarker) Status(alert model.Fingerprint) AlertStatus {
m.mtx.RLock()
defer m.mtx.RUnlock()
if s, found := m.alerts[alert]; found {
return *s
}
return AlertStatus{
State: AlertStateUnprocessed,
SilencedBy: []string{},
InhibitedBy: []string{},
}
}

@SoloJacobs SoloJacobs mentioned this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alert marker fills up with deleted alerts' states

4 participants