Alertmanager cluster peers have spiky routines

We have observed that the Alertmanager cluster peer on position `0` (which is the first to send notifications) has almost flat routines metric.
While the other 2 peers in following positions have more spiky routines up to 3K routines (+30%).
In the below graph we can see how the yellow instance at position zero had a flat routines graph but when the positions are swapped by renaming peers, the blue is at position zero. The switch in positions moves the spiky routines symptom between the peers.

<img width="1832" height="1054" alt="Image" src="https://github.com/user-attachments/assets/20f09358-8486-4f7f-b9bc-37ffc595c52d" />

So the fanout stage creates all these go routines here https://github.com/prometheus/alertmanager/blob/edb9a4d93637d500cb11916a1d3dfa4e41c5a4fa/notify/notify.go#L491-L514

and then the wait stage will wait `peer timeout * position` here https://github.com/prometheus/alertmanager/blob/edb9a4d93637d500cb11916a1d3dfa4e41c5a4fa/notify/notify.go#L599-L607

This seems very unoptimised and wasteful, we could potentially redesign these stages to wait before creating the routines instead of creating, waiting and then destroying.

	// Exec attempts to execute all stages concurrently and discards the results.
	// It returns its input alerts and a types.MultiError if one or more stages fail.
	func (fs FanoutStage) Exec(ctx context.Context, l slog.Logger, alerts ...types.Alert) (context.Context, []*types.Alert, error) {
	var (
	wg sync.WaitGroup
	me types.MultiError
	)
	wg.Add(len(fs))

	for _, s := range fs {
	go func(s Stage) {
	if _, _, err := s.Exec(ctx, l, alerts...); err != nil {
	me.Add(err)
	}
	wg.Done()
	}(s)
	}
	wg.Wait()

	if me.Len() > 0 {
	return ctx, alerts, &me
	}
	return ctx, alerts, nil
	}

	// Exec implements the Stage interface.
	func (ws WaitStage) Exec(ctx context.Context, _ slog.Logger, alerts ...types.Alert) (context.Context, []types.Alert, error) {
	select {
	case <-time.After(ws.wait()):
	case <-ctx.Done():
	return ctx, nil, ctx.Err()
	}
	return ctx, alerts, nil
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Alertmanager cluster peers have spiky routines #4605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Alertmanager cluster peers have spiky routines #4605

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions