-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix(marker): stop state leakage from aggregation groups #4438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
20c2483 to
cdff2d7
Compare
101e444 to
83a127e
Compare
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test changes, it doesn't seem like this exercises the problem behavior. Would you mind including some additional testing to make sure we don't break in the future?
This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in prometheus#4402. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Done, added 3 scenarios. The code is change to make sure that we don't accidentally delete a marker due to a race condition. Please have a look and let me know if we should add more. But while adding these I noticed the whole Marker setup to be a week point in terms of consistency, we need to come up with a better solution at some point. |
SuperQ
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
| ag.logger.Error("error on delete alerts", "err", err) | ||
| } else { | ||
| // Delete markers for resolved alerts that are not in the store. | ||
| for _, alert := range resolvedSlice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just heads up this is a race condition for the same reason we had to implement DeleteIfNotModified. There is a missing check to make sure the deleted alert is the same alert returned from ag.alerts.Get(alert.Fingerprint()).
Without this check what can happen is we delete the alert in between DeletedIfNotModified and then before we call Get a new alert is received. What happens then is we delete the marker of the new alert by mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on Slack, this is safe since even if we delete the marker the alert status would then be unprocessed:
Lines 261 to 274 in 52eb1fc
| // Status implements AlertMarker. | |
| func (m *MemMarker) Status(alert model.Fingerprint) AlertStatus { | |
| m.mtx.RLock() | |
| defer m.mtx.RUnlock() | |
| if s, found := m.alerts[alert]; found { | |
| return *s | |
| } | |
| return AlertStatus{ | |
| State: AlertStateUnprocessed, | |
| SilencedBy: []string{}, | |
| InhibitedBy: []string{}, | |
| } | |
| } |
This change makes aggregation groups to delete resolved alerts from marker, therefore avoiding the leakage of ghost states mentioned in #4402.
Fixes: #4402