Skip to content

Dispatcher maintenance is a bottleneck in alert processing pipeline #4540

@siavashs

Description

@siavashs

As mentioned in #4392 the alerts channel which dispatcher subscribes to can fill up.
We measured this with the patch in #4364 and it indeed can hit 100% capacity:

Image

But root cause of this bottleneck is not the channel capacity itself, capacity only acts as a buffer.
The root cause is the maintenance done in Dispatcher:

func (d *Dispatcher) doMaintenance() {
d.mtx.Lock()
defer d.mtx.Unlock()
for _, groups := range d.aggrGroupsPerRoute {
for _, ag := range groups {
if ag.empty() {
ag.stop()
d.marker.DeleteByGroupKey(ag.routeID, ag.GroupKey())
delete(groups, ag.fingerprint())
d.aggrGroupsNum--
d.metrics.aggrGroups.Dec()
}
}
}
}

which happens every 30s:

maintenance := time.NewTicker(30 * time.Second)

When Alertmanager receives a high alert volume, Dispatcher will dynamically create a lot of AggregationGroups depending on configuration:

Image

But we observe that the rate of alert processing by dispatcher is very spiky, since every 30s it paused to do maintenance and stops processing alerts.

Also the more AGs there are the longer the maintenance will take as it has to loop over tens of thousands of AGs (AGs explosion is described in #4503).

We can try to make maintenance happen less often, because as the graph shows it does not reduce the number of AGs significantly, so it can be made configurable and/or use the same value as data maintenance interval (default 15m).

Ultimately we need a less congested AG management system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions