Skip to content

Alertmanager cluster peers have spiky routines #4605

@siavashs

Description

@siavashs

We have observed that the Alertmanager cluster peer on position 0 (which is the first to send notifications) has almost flat routines metric.
While the other 2 peers in following positions have more spiky routines up to 3K routines (+30%).
In the below graph we can see how the yellow instance at position zero had a flat routines graph but when the positions are swapped by renaming peers, the blue is at position zero. The switch in positions moves the spiky routines symptom between the peers.

Image

So the fanout stage creates all these go routines here

// Exec attempts to execute all stages concurrently and discards the results.
// It returns its input alerts and a types.MultiError if one or more stages fail.
func (fs FanoutStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
var (
wg sync.WaitGroup
me types.MultiError
)
wg.Add(len(fs))
for _, s := range fs {
go func(s Stage) {
if _, _, err := s.Exec(ctx, l, alerts...); err != nil {
me.Add(err)
}
wg.Done()
}(s)
}
wg.Wait()
if me.Len() > 0 {
return ctx, alerts, &me
}
return ctx, alerts, nil
}

and then the wait stage will wait peer timeout * position here

// Exec implements the Stage interface.
func (ws *WaitStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
select {
case <-time.After(ws.wait()):
case <-ctx.Done():
return ctx, nil, ctx.Err()
}
return ctx, alerts, nil
}

This seems very unoptimised and wasteful, we could potentially redesign these stages to wait before creating the routines instead of creating, waiting and then destroying.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions