-
Couldn't load subscription status.
- Fork 2.3k
Description
We have observed that the Alertmanager cluster peer on position 0 (which is the first to send notifications) has almost flat routines metric.
While the other 2 peers in following positions have more spiky routines up to 3K routines (+30%).
In the below graph we can see how the yellow instance at position zero had a flat routines graph but when the positions are swapped by renaming peers, the blue is at position zero. The switch in positions moves the spiky routines symptom between the peers.
So the fanout stage creates all these go routines here
Lines 491 to 514 in edb9a4d
| // Exec attempts to execute all stages concurrently and discards the results. | |
| // It returns its input alerts and a types.MultiError if one or more stages fail. | |
| func (fs FanoutStage) Exec(ctx context.Context, l *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { | |
| var ( | |
| wg sync.WaitGroup | |
| me types.MultiError | |
| ) | |
| wg.Add(len(fs)) | |
| for _, s := range fs { | |
| go func(s Stage) { | |
| if _, _, err := s.Exec(ctx, l, alerts...); err != nil { | |
| me.Add(err) | |
| } | |
| wg.Done() | |
| }(s) | |
| } | |
| wg.Wait() | |
| if me.Len() > 0 { | |
| return ctx, alerts, &me | |
| } | |
| return ctx, alerts, nil | |
| } |
and then the wait stage will wait peer timeout * position here
Lines 599 to 607 in edb9a4d
| // Exec implements the Stage interface. | |
| func (ws *WaitStage) Exec(ctx context.Context, _ *slog.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { | |
| select { | |
| case <-time.After(ws.wait()): | |
| case <-ctx.Done(): | |
| return ctx, nil, ctx.Err() | |
| } | |
| return ctx, alerts, nil | |
| } |
This seems very unoptimised and wasteful, we could potentially redesign these stages to wait before creating the routines instead of creating, waiting and then destroying.