Skip to content

Aggregation Groups result in too many go routines #4503

@siavashs

Description

@siavashs

Aggregation groups run in dedicated go routines, this means that alertmanager can create thousands of routines, depending on the number of unique alerts it receives and grouping config:

go ag.run(func(ctx context.Context, alerts ...*types.Alert) bool {

These routines are usually waiting in a select statement for the next ticker which seems very wasteful:

for {
select {
case now := <-ag.next.C:

Some graphs from our deployment:

Image Image

A better approach could be creating one go routine per receiver instead as aggregation groups share receivers. This should reduce the number of go routines significantly.
Aggregation groups can depend on receiver based trigger to flush notifications.


Aggregation groups also support a global limit:

// If the group does not exist, create it. But check the limit first.
if limit := d.limits.MaxNumberOfAggregationGroups(); limit > 0 && d.aggrGroupsNum >= limit {
d.metrics.aggrGroupLimitReached.Inc()
d.logger.Error("Too many aggregation groups, cannot create new group for alert", "groups", d.aggrGroupsNum, "limit", limit, "alert", alert.Name())
return
}

but this is currently not used(nil limiter):
disp = dispatch.NewDispatcher(alerts, routes, pipeline, marker, timeoutFunc, nil, logger, dispMetrics)

I think to protect alertmanager under load, we should implement a per receiver limiter.
If alerts sent to a receiver are not acted upon it can be assumed that alert manager can limit the number of aggregation groups created for that receiver.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions