Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Oct 27, 2025

This change significantely reduces the number of sleeping go routines created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Closes #4503

This change significantely reduces the number of sleeping go routines
created per aggregation group and waiting for a timer tick.

Instead use time.AfterFunc to schedule the next call to flush.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@rajagopalanand
Copy link
Contributor

Do you have any profile captured to show before/after effects of this change?

Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@SuperQ
Copy link
Member

SuperQ commented Oct 27, 2025

Yes, it would be nice to post a pprof profile and/or metrics to show the results of this change.

@siavashs
Copy link
Contributor Author

siavashs commented Oct 27, 2025

Here are some metrics, in both cases I run the same config for Prometheus and Alertmanager which results in 1500 unique alerts and Aggregation Groups:

From main:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1532

From this branch:

# HELP alertmanager_dispatcher_aggregation_groups Number of active aggregation groups
# TYPE alertmanager_dispatcher_aggregation_groups gauge
alertmanager_dispatcher_aggregation_groups 1500

# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 32

Looking at pprof/goroutines?debug=1:

From main:

goroutine profile: total 1529
1500 @ 0x100e0e160 0x100dec7cc 0x1016c2480 0x100e16a04
#	0x1016c247f	github.com/prometheus/alertmanager/dispatch.(*aggrGroup).run+0x3ff	alertmanager/dispatch/dispatch.go:446
...

From this branch no dispatch.(*aggrGroup).run exists to show.

(Note that when flush happens we see a lot of go routines still but those are from notify which we will fix in #4633)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggregation Groups result in too many go routines

3 participants