Dispatcher maintenance is a bottleneck in alert processing pipeline

As mentioned in #4392 the alerts channel which dispatcher subscribes to can fill up.
We measured this with the patch in #4364 and it indeed can hit 100% capacity:

<img width="1523" height="515" alt="Image" src="https://github.com/user-attachments/assets/a778e2d9-d955-4818-9c87-ce14cd12bf31" />

But root cause of this bottleneck is not the channel capacity itself, capacity only acts as a buffer.
The root cause is the maintenance done in Dispatcher: https://github.com/prometheus/alertmanager/blob/f6b942cf9b3a503d59192eada300d2ad97cba82f/dispatch/dispatch.go#L187-L201

which happens every 30s: https://github.com/prometheus/alertmanager/blob/f6b942cf9b3a503d59192eada300d2ad97cba82f/dispatch/dispatch.go#L149

When Alertmanager receives a high alert volume, Dispatcher will dynamically create a lot of `AggregationGroup`s depending on configuration:

<img width="958" height="914" alt="Image" src="https://github.com/user-attachments/assets/ad8ecfb9-c639-430a-83ea-ccd74eb4b753" />

But we observe that the rate of alert processing by dispatcher is very spiky, since every 30s it paused to do maintenance and stops processing alerts.

Also the more AGs there are the longer the maintenance will take as it has to loop over tens of thousands of AGs (AGs explosion is described in #4503).

We can try to make maintenance happen less often, because as the graph shows it does not reduce the number of AGs significantly, so it can be made configurable and/or use the same value as data maintenance interval (default 15m).

Ultimately we need a less congested AG management system.

	func (d *Dispatcher) doMaintenance() {
	d.mtx.Lock()
	defer d.mtx.Unlock()
	for _, groups := range d.aggrGroupsPerRoute {
	for _, ag := range groups {
	if ag.empty() {
	ag.stop()
	d.marker.DeleteByGroupKey(ag.routeID, ag.GroupKey())
	delete(groups, ag.fingerprint())
	d.aggrGroupsNum--
	d.metrics.aggrGroups.Dec()
	}
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dispatcher maintenance is a bottleneck in alert processing pipeline #4540

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Dispatcher maintenance is a bottleneck in alert processing pipeline #4540

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions