feat(dispatch): add alert markers per group#5047
Conversation
cece42d to
b5e2104
Compare
b5e2104 to
b4055cb
Compare
|
Reviewed the new marker package standalone. LGTM |
| } | ||
|
|
||
| // SetInhibited implements AlertMarker. | ||
| func (m *alertMarker) SetInhibited(fp model.Fingerprint, inhibitedBy []string) { |
There was a problem hiding this comment.
Just out of curiosity the previous SetIhibited used
func (m *MemMarker) SetInhibited(alert model.Fingerprint, ids ...string) {
And we changed it to []string ... Is there a reason we changed?
There was a problem hiding this comment.
Just for consistency in the interface API for both SetInhibited and SetSilenced.
| delete(m.groups, routeID+groupKey) | ||
| } | ||
|
|
||
| func (m *MemMarker) registerMetrics(r prometheus.Registerer) { |
There was a problem hiding this comment.
We should note in the release note that we removed this metric, as it would be too expensive to calculate it across all markers... ? Unless each time we change it a marker we were to add/remove the relevant number from the metric... But yeah, if not worth it/useful definitely we should say it's gone
There was a problem hiding this comment.
I put it in the PR description:
[CHANGE] Remove `alertmanager_marked_alerts`
Do we need to also update any MD files? cc @SoloJacobs
| } else { | ||
| filtered = append(filtered, a) | ||
| } | ||
| // TODO(fabxc): increment muted alerts counter if muted. |
There was a problem hiding this comment.
Are this TODO and the one above still relevant?
There was a problem hiding this comment.
I think he wanted to add/change the metrics here.
I am not sure if we need it.
I can remove these but my changes don't touch these.
724c634 to
9a5c33e
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrates from a global marker to per-aggregation-group markers, removes callback-based alert-status plumbing (Options.AlertStatusFunc), introduces a marker package with context propagation, switches code from ChangesPer-group marker migration
Sequence DiagramsequenceDiagram
participant Client
participant APIv2
participant Dispatcher
participant AggrGroup
participant markerGroup as GroupMarker
participant alertMarker as AlertMarker
participant Context
Client->>APIv2: GET /api/v2/alerts
APIv2->>Dispatcher: request groups + alertFilter(ctx)
Dispatcher->>markerGroup: ensure group-level marker (marker.NewGroupMarker)
Dispatcher->>AggrGroup: create aggrGroup (allocates alertMarker = NewAlertMarker())
AggrGroup->>Context: ctx' = marker.WithContext(ctx, alertMarker)
Dispatcher->>APIv2: return groups
APIv2->>alertMarker: Status(fp) via marker.FromContext(ctx') for each alert
alertMarker-->>APIv2: alert.AlertStatus
APIv2-->>Client: OpenAPI response with per-alert statuses
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
inhibit/inhibit_test.go (1)
47-52:⚠️ Potential issue | 🟡 MinorKeep the empty
InhibitedByassertion on the active path.Adding the
AlertStateActivecheck is good, but droppingrequire.Empty(t, status.InhibitedBy, ...)weakens this helper again. The non-muted branch should still prove both that the marker was written and that no inhibition IDs were recorded.Suggested tweak
} else { require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) + require.Empty(t, status.InhibitedBy, msgAndArgs...) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inhibit/inhibit_test.go` around lines 47 - 52, The non-muted branch dropped the assertion that no inhibition IDs were recorded; when wantMuted is false you should also assert status.InhibitedBy is empty. Modify the else branch that checks require.Equal(t, alert.AlertStateActive, status.State, msgAndArgs...) to additionally call require.Empty(t, status.InhibitedBy, msgAndArgs...) so the helper verifies both the active state and that no inhibition IDs were written (referencing wantMuted, status.State, status.InhibitedBy, and AlertStateActive).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@api/v2/api.go`:
- Around line 434-442: The code creates a single request-scoped tempMarker via
marker.NewAlertMarker() and passes it into api.alertFilter, which lets
setAlertStatus mutate that shared tempMarker causing status races across groups;
instead, change the logic so /alerts/groups does not reuse a single tempMarker
for the entire request: compute or fetch each group's actual marker/status
(e.g., call the group's marker lookup or instantiate a fresh marker per group)
and pass that per-group marker into setAlertStatus/api.alertFilter rather than
the global tempMarker, ensuring you stop using tempMarker.Status(fp) as a
request-wide cache; update the code paths around api.alertFilter,
setAlertStatus, and any callers that currently rely on the shared tempMarker
(also adjust the same pattern at the other noted spot around lines 469-470).
---
Duplicate comments:
In `@inhibit/inhibit_test.go`:
- Around line 47-52: The non-muted branch dropped the assertion that no
inhibition IDs were recorded; when wantMuted is false you should also assert
status.InhibitedBy is empty. Modify the else branch that checks require.Equal(t,
alert.AlertStateActive, status.State, msgAndArgs...) to additionally call
require.Empty(t, status.InhibitedBy, msgAndArgs...) so the helper verifies both
the active state and that no inhibition IDs were written (referencing wantMuted,
status.State, status.InhibitedBy, and AlertStateActive).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7fbf439a-752e-4283-977c-8c9f86a7d849
📒 Files selected for processing (2)
api/v2/api.goinhibit/inhibit_test.go
| c.emit(ch, float64(suppressed), labelValues...) | ||
| labelValues[0] = string(alert.AlertStateUnprocessed) | ||
| c.emit(ch, float64(unprocessed), labelValues...) | ||
| return true |
There was a problem hiding this comment.
Here do we want to exclude groups that have 0 in each state? or not?
There was a problem hiding this comment.
I guess not, because these are gauge metrics and can drop to zero, so I assume having zero values in between changes is desired compared to missing metrics.
What do you think?
| // Set alert's current status based on its label set. | ||
| api.setAlertStatus(ctx, a.Labels) | ||
| // The inhibitor and silencer write to m via the context. | ||
| ctx = marker.WithContext(ctx, m) |
There was a problem hiding this comment.
Do you want to create this context outside the function itself? This way we would avoid creating a new context each time the returned func is called
There was a problem hiding this comment.
No, we need to keep it as is since each call to the returned function is a new operation that we want to trace as a child span.
This specific call returns a derived context which is pointing tot he parent.
| status = &groupStatus{} | ||
| m.groups[key] = status | ||
| } | ||
| status.mutedBy = timeIntervalNames |
There was a problem hiding this comment.
If timeIntervalNames is nil or empty, shoudl we actually delete from the map, here?
There was a problem hiding this comment.
I am not sure, but this is just moving existing logic and is not new code.
| "pending", len(allIDs)-len(activeIDs), | ||
| ) | ||
| // TODO: remove this sort once the marker is removed. | ||
| sort.Strings(activeIDs) |
There was a problem hiding this comment.
Are we sure we shouldn't still sort for the API to be consistent and the results comparable? not necessarily here, but somewhere...
There was a problem hiding this comment.
I am not sure, but does it matter if the IDs are in a specific order or do we guarantee this?
| status = &groupStatus{} | ||
| m.groups[key] = status | ||
| } | ||
| status.mutedBy = timeIntervalNames |
There was a problem hiding this comment.
In other places you clone the slice, but not here... Any particular reason?
There was a problem hiding this comment.
This is just moving existing logic and is not new code.
The PR is not focused on Group Markers used by time intervals so I would leave any changes related to those for later.
a28f0b4 to
db964e5
Compare
This change adds alert markers to the aggregation groups in dispatcher. Alert markers replace the global marker and are used to track the state of alerts in each aggregation group. This change touches many components of the alertmanager. Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context. The API has no breaking changes: - /alerts uses a temporary marker to track the state of alerts - /alerts/groups returns the group markers Update metrics to use group markers. The `alertmanager_alerts` metric is moved to dispatcher. The `alertmanager_marked_alerts` metric is removed. By default it behaves the same as before, by aggregating all alerts in the groups. Enabling `group-key-in-metrics` flag will cause the metrics to be grouped by `group_key`. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
db964e5 to
c146730
Compare
Compute each alert's status with a fresh AlertMarker at query time instead of reading the group's actual marker or sharing a single request-scoped marker across all groups. This addresses two issues: 1. A shared tempMarker across groups let setAlertStatus writes from one group overwrite another, since the same fingerprint can appear in multiple aggregation groups. 2. Reading the group's actual marker (alertGroup.AlertStatuses) makes the response depend on whether the notification pipeline has run between calls. The endpoint should return the same output when no new alerts, silences, or inhibits have been added. The filter callback also uses a fresh per-alert marker when none is provided, so prediction for one alert does not leak into another. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
marker/marker.go (1)
26-33: ⚡ Quick winDefine slice ownership/mutation semantics in the interface contract.
[]stringis accepted and returned on goroutine-safe interfaces, but ownership rules are not specified. Please document whether implementations copy inputs and whether returned slices are immutable/copies, so callers cannot accidentally introduce aliasing races.💡 Suggested doc update
type AlertMarker interface { // SetInhibited sets the inhibitedBy for the given fingerprint. - // If inhibitedBy is empty, it clears the inhibitedBy. + // Implementations must not retain caller-owned slice backing arrays. + // If inhibitedBy is empty, it clears the inhibitedBy. SetInhibited(fingerprint model.Fingerprint, inhibitedBy []string) // SetSilenced sets the silencedBy for the given fingerprint. - // If silencedBy is empty, it clears the silencedBy. + // Implementations must not retain caller-owned slice backing arrays. + // If silencedBy is empty, it clears the silencedBy. SetSilenced(fingerprint model.Fingerprint, silencedBy []string) @@ type GroupMarker interface { // Muted returns true if the group is muted, otherwise false. If the group // is muted then it also returns the names of the time intervals that muted - // it. + // it. Callers must treat the returned slice as immutable. Muted(routeID, groupKey string) ([]string, bool)Also applies to: 49-57
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@marker/marker.go` around lines 26 - 33, Update the interface documentation to define slice ownership/mutation semantics: for SetInhibited and SetSilenced (and any other methods in the same interface around lines 49-57), state explicitly that implementations MUST copy the input []string before storing it (so callers retain ownership) and that any method returning []string MUST return a newly allocated copy (so callers may mutate without causing aliasing/races); mention this contract in the comments for SetInhibited and SetSilenced and in the corresponding getter methods to make the goroutine-safety guarantee explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dispatch/metric.go`:
- Around line 40-45: newNoopDispatcherMetrics currently constructs a
DispatcherMetrics with several Prometheus metrics but omits initializing
aggrGroupCreationRetries and aggrGroupCreationGivenUp, which can cause
nil-interface panics when .Inc() is later called; update
newNoopDispatcherMetrics to initialize those two fields
(aggrGroupCreationRetries and aggrGroupCreationGivenUp) with non-nil
prometheus.Counter values (e.g., prometheus.NewCounter with empty CounterOpts,
matching how aggrGroupLimitReached is created) so all metric fields on
DispatcherMetrics are always non-nil.
---
Nitpick comments:
In `@marker/marker.go`:
- Around line 26-33: Update the interface documentation to define slice
ownership/mutation semantics: for SetInhibited and SetSilenced (and any other
methods in the same interface around lines 49-57), state explicitly that
implementations MUST copy the input []string before storing it (so callers
retain ownership) and that any method returning []string MUST return a newly
allocated copy (so callers may mutate without causing aliasing/races); mention
this contract in the comments for SetInhibited and SetSilenced and in the
corresponding getter methods to make the goroutine-safety guarantee explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a1e23e96-57fb-4c00-b1f1-72beead7e651
📒 Files selected for processing (31)
alert/state.goapi/api.goapi/v2/api.goapi/v2/api_test.goapi/v2/compat.gocmd/alertmanager/main.godispatch/dispatch.godispatch/dispatch_bench_test.godispatch/dispatch_test.godispatch/metric.gofeaturecontrol/featurecontrol.goinhibit/inhibit.goinhibit/inhibit_bench_test.goinhibit/inhibit_test.gomarker/alert.gomarker/alert_test.gomarker/context.gomarker/group.gomarker/group_test.gomarker/marker.gomarker/status.gonotify/mute.gonotify/mute_test.gonotify/notify.goprovider/mem/mem.goprovider/mem/mem_test.gosilence/silence.gosilence/silence_bench_test.gosilence/silence_test.gotypes/types.gotypes/types_test.go
💤 Files with no reviewable changes (4)
- api/api.go
- types/types_test.go
- types/types.go
- provider/mem/mem.go
🚧 Files skipped from review as they are similar to previous changes (23)
- marker/group.go
- marker/group_test.go
- alert/state.go
- inhibit/inhibit_bench_test.go
- featurecontrol/featurecontrol.go
- marker/context.go
- dispatch/dispatch_bench_test.go
- api/v2/compat.go
- cmd/alertmanager/main.go
- silence/silence_test.go
- inhibit/inhibit.go
- marker/alert_test.go
- notify/mute.go
- silence/silence.go
- api/v2/api_test.go
- marker/alert.go
- notify/mute_test.go
- silence/silence_bench_test.go
- inhibit/inhibit_test.go
- notify/notify.go
- api/v2/api.go
- dispatch/dispatch.go
- dispatch/dispatch_test.go
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Replace the global alert marker with per-aggregation-group alert markers in the dispatcher. Per-group markers track alert state in each group and are passed to the notifier, inhibitor, and silencer through context. A new marker package centralises AlertMarker and GroupMarker, replacing the marker types previously living in the types package (MemMarker, AlertState*, AlertStatus, AlertMarker, GroupMarker). The silencer and inhibitor no longer hold a marker reference; they read it from the context and defer writes until the call returns. The API has no breaking changes: - /alerts predicts each alert's status at query time using a temporary marker. - /alerts/groups predicts each alert's status per (alert, group) with a fresh marker, so the response is stable across calls regardless of whether the notification pipeline has run in between (assuming no new alerts, silences, or inhibits) and prediction for one alert does not leak into another. Metrics are updated to use the group markers: - `alertmanager_alerts` moves to the dispatcher and is emitted by a custom collector that walks the per-group markers. - `alertmanager_marked_alerts` is removed (it would be too expensive to compute across all per-group markers). - A new `group-key-in-metrics` feature flag adds a `group_key` label to `alertmanager_alerts`; by default the metric still aggregates across all groups. - A no-op DispatcherMetrics is used when no registerer is provided so the dispatcher never dereferences nil counters. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
This change adds alert markers to the aggregation groups in dispatcher.
Alert markers replace the global marker and are used to track the state of alerts in each aggregation group.
This change touches many components of the alertmanager.
Per Group alert markers are passed to the notifiers and then inhibitor and silencer using context.
The API has no breaking changes:
The metrics are also updated to use group markers.
The
alertmanager_alertsmetric is moved to dispatcher.The
alertmanager_marked_alertsmetric is removed.By default it behaves the same as before, by aggregating all alerts in the groups.
Enabling
group-key-in-metricsflag will cause the metrics to be grouped bygroup_key.Pull Request Checklist
Please check all the applicable boxes.
Which user-facing changes does this PR introduce?
Summary by CodeRabbit
New Features
Bug Fixes
Refactor