fix: improve inhibition performance #4607

siavashs · 2025-10-13T15:21:11Z

This change adds a new index per inhibition rule which:

extracts the subset of source alert labelset which are in equals
calculates the fingerprint of the above
maps the calculated fingerprint to the source alert fingerprint
performs the same calculation for target alerts
uses the above index to find the equal source alerts quickly

This significantly improves the inhibition performance, since there is no need to loop over all source alerts and the equal labels.

The equals index items are garbage collected by callback from scache.

Fixes #4606

Benchmarks

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Pro
                                                      │ benchmark-inhibit-base.txt │     benchmark-inhibit-fixed.txt     │
                                                      │           sec/op           │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                         518.5n ± 1%   485.3n ± 11%   -6.40% (p=0.024 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                       520.7n ± 1%   484.7n ±  2%   -6.91% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                      537.4n ± 0%   513.7n ±  9%        ~ (p=0.165 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                     672.6n ± 5%   604.0n ±  3%  -10.20% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                    742.7n ± 4%   702.6n ±  6%   -5.40% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                       655.3n ± 1%   547.4n ±  7%  -16.47% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                     1292.0n ± 2%   558.3n ±  5%  -56.79% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                    8585.0n ± 1%   586.4n ±  2%  -93.17% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                  68664.0n ± 1%   577.4n ±  2%  -99.16% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                 7814.0n ± 2%   550.1n ±  2%  -92.96% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                       1008.0n ± 1%   863.5n ±  4%  -14.34% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                       5.600µ ± 1%   3.856µ ±  7%  -31.14% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                      52.00µ ± 3%   34.47µ ±  6%  -33.71% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                     527.9µ ± 3%   338.5µ ±  3%  -35.88% (p=0.001 n=7)
geomean                                                                3.514µ        1.402µ        -60.10%

                                                      │ benchmark-inhibit-base.txt │    benchmark-inhibit-fixed.txt    │
                                                      │            B/op            │    B/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                          496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                        496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                       496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                      496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                     496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                        568.0 ± 0%   488.0 ± 0%  -14.08% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                      1384.0 ± 0%   488.0 ± 0%  -64.74% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                     8683.0 ± 0%   488.0 ± 0%  -94.38% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                   82424.0 ± 0%   488.0 ± 0%  -99.41% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  8680.0 ± 0%   488.0 ± 0%  -94.38% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                         480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                        480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                       480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                      481.0 ± 0%   472.0 ± 0%   -1.87% (p=0.001 n=7)
geomean                                                               1.131Ki        483.4       -58.26%

                                                      │ benchmark-inhibit-base.txt │   benchmark-inhibit-fixed.txt    │
                                                      │         allocs/op          │ allocs/op   vs base              │
Mutes/1_inhibition_rule,_1_inhibiting_alert-12                          11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                     11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                     11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                   11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                         11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
geomean                                                                 11.00        10.00       -9.09%

siavashs · 2025-10-13T16:02:50Z

There is a flaky test which I narrowed to this case:

		{
			name: "matching and unresolved",
			initial: map[model.Fingerprint]*types.Alert{
				1: {
					Alert: model.Alert{
						Labels:   model.LabelSet{"a": "b", "c": "d"},
						StartsAt: now.Add(-time.Minute),
						EndsAt:   now.Add(-time.Second),
					},
				},
				2: {
					Alert: model.Alert{
						Labels:   model.LabelSet{"a": "b", "c": "f"},
						StartsAt: now.Add(-time.Minute),
						EndsAt:   now.Add(time.Hour),
					},
				},
			},
			equal:  model.LabelNames{"a"},
			input:  model.LabelSet{"a": "b"},
			result: true,
		},

The test is flaky since we use a map and the alerts are not inserted in order so 1 can overwrite 2 sometimes.
While this could happen, we are not replicating the correct behaviour of Alertmanager here which is only fresh alerts being written into inhibitor and constantly updated by heartbeats.

The test can be made consistent by simply switching to a slice but I wanted to point out the above here just in case I'm missing something.

siavashs · 2025-10-13T16:58:33Z

Here is some data from our internal custom metrics in Alertmanager Inhibitor.

As you can see inhibitor works much faster now in all cases:

target alert matching an inhibit rule and getting muted, best case, depends on inhibit rule index and count of source alerts
target alert matching and inhibit rule but not getting muted, worst case, depends on number of inhibit rules and count of source alerts
all alert not muted (matched or not matched), worse and worst cases, depends on number of inhibit rules and count of source alerts

Custom metrics need some fixes, and can be included in this PR or a separate one.

siavashs · 2025-10-15T08:26:04Z

An idea to fix the flaky test and the logic would be to always compare a new source alert EndsAt with our existing index entry, and keep the one which is going to be active for longer.
It is an edge case but ensures better behaviour by inhibitor if user has multiple Prometheus instances with mismatching alerting configuration.

This change adds a new index per inhibition rule which: 1. extracts the subset of source alert labelset which are in equals 2. calculates the fingerprint of the above 3. maps the calculated fingerprint to the source alert fingerprint 4. performs the same calculation for target alerts 5. uses the above index to find the equal source alerts quickly This significantly improves the inhibition performance, since there is no need to loop over all source alerts and the equal labels. The equals index items are garbage collected by callback from `scache`. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs · 2025-10-15T09:03:12Z

Fixed the flaky test in the last change.

SuperQ

LGTM, thanks for fixing the flaky test.

siavashs force-pushed the fix/improve-inhibition-performance branch 3 times, most recently from a91f2cf to f2a5647 Compare October 13, 2025 15:59

siavashs force-pushed the fix/improve-inhibition-performance branch from f2a5647 to e60f4f0 Compare October 15, 2025 08:55

SuperQ requested a review from gotjosh October 17, 2025 15:45

SuperQ approved these changes Oct 17, 2025

View reviewed changes

SuperQ merged commit 8479a85 into prometheus:main Oct 17, 2025
11 checks passed

siavashs deleted the fix/improve-inhibition-performance branch October 18, 2025 09:25

This was referenced Oct 19, 2025

fix(dispatch): reduce locking contention #4552

Draft

Signficantly improve inhibitor performance via new cache datastructure #4134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: improve inhibition performance #4607

fix: improve inhibition performance #4607

siavashs commented Oct 13, 2025 •

edited

Loading

Uh oh!

siavashs commented Oct 13, 2025 •

edited

Loading

Uh oh!

siavashs commented Oct 13, 2025 •

edited

Loading

Uh oh!

siavashs commented Oct 15, 2025

Uh oh!

siavashs commented Oct 15, 2025

Uh oh!

SuperQ left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: improve inhibition performance #4607

fix: improve inhibition performance #4607

Conversation

siavashs commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siavashs commented Oct 15, 2025

Uh oh!

siavashs commented Oct 15, 2025

Uh oh!

SuperQ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siavashs commented Oct 13, 2025 •

edited

Loading

siavashs commented Oct 13, 2025 •

edited

Loading

siavashs commented Oct 13, 2025 •

edited

Loading