-
Couldn't load subscription status.
- Fork 2.3k
Description
As mentioned in #4392 the alerts channel which dispatcher subscribes to can fill up.
We measured this with the patch in #4364 and it indeed can hit 100% capacity:
But root cause of this bottleneck is not the channel capacity itself, capacity only acts as a buffer.
The root cause is the maintenance done in Dispatcher:
alertmanager/dispatch/dispatch.go
Lines 187 to 201 in f6b942c
| func (d *Dispatcher) doMaintenance() { | |
| d.mtx.Lock() | |
| defer d.mtx.Unlock() | |
| for _, groups := range d.aggrGroupsPerRoute { | |
| for _, ag := range groups { | |
| if ag.empty() { | |
| ag.stop() | |
| d.marker.DeleteByGroupKey(ag.routeID, ag.GroupKey()) | |
| delete(groups, ag.fingerprint()) | |
| d.aggrGroupsNum-- | |
| d.metrics.aggrGroups.Dec() | |
| } | |
| } | |
| } | |
| } |
which happens every 30s:
alertmanager/dispatch/dispatch.go
Line 149 in f6b942c
| maintenance := time.NewTicker(30 * time.Second) |
When Alertmanager receives a high alert volume, Dispatcher will dynamically create a lot of AggregationGroups depending on configuration:
But we observe that the rate of alert processing by dispatcher is very spiky, since every 30s it paused to do maintenance and stops processing alerts.
Also the more AGs there are the longer the maintenance will take as it has to loop over tens of thousands of AGs (AGs explosion is described in #4503).
We can try to make maintenance happen less often, because as the graph shows it does not reduce the number of AGs significantly, so it can be made configurable and/or use the same value as data maintenance interval (default 15m).
Ultimately we need a less congested AG management system.