Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Oct 13, 2025

This change adds a new index per inhibition rule which:

  1. extracts the subset of source alert labelset which are in equals
  2. calculates the fingerprint of the above
  3. maps the calculated fingerprint to the source alert fingerprint
  4. performs the same calculation for target alerts
  5. uses the above index to find the equal source alerts quickly

This significantly improves the inhibition performance, since there is no need to loop over all source alerts and the equal labels.

The equals index items are garbage collected by callback from scache.

Fixes #4606

Benchmarks

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Probenchmark-inhibit-base.txtbenchmark-inhibit-fixed.txt     │
                                                      │           sec/opsec/op     vs baseMutes/1_inhibition_rule,_1_inhibiting_alert-12                         518.5n ± 1%   485.3n ± 11%   -6.40% (p=0.024 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                       520.7n ± 1%   484.7n ±  2%   -6.91% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                      537.4n ± 0%   513.7n ±  9%        ~ (p=0.165 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                     672.6n ± 5%   604.0n ±  3%  -10.20% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                    742.7n ± 4%   702.6n ±  6%   -5.40% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                       655.3n ± 1%   547.4n ±  7%  -16.47% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                     1292.0n ± 2%   558.3n ±  5%  -56.79% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                    8585.0n ± 1%   586.4n ±  2%  -93.17% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                  68664.0n ± 1%   577.4n ±  2%  -99.16% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                 7814.0n ± 2%   550.1n ±  2%  -92.96% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                       1008.0n ± 1%   863.5n ±  4%  -14.34% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                       5.600µ ± 1%   3.856µ ±  7%  -31.14% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                      52.00µ ± 3%   34.47µ ±  6%  -33.71% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                     527.9µ ± 3%   338.5µ ±  3%  -35.88% (p=0.001 n=7)
geomean                                                                3.514µ        1.402µ        -60.10%benchmark-inhibit-base.txtbenchmark-inhibit-fixed.txt    │
                                                      │            B/opB/op     vs baseMutes/1_inhibition_rule,_1_inhibiting_alert-12                          496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                        496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                       496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                      496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                     496.0 ± 0%   488.0 ± 0%   -1.61% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                        568.0 ± 0%   488.0 ± 0%  -14.08% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                      1384.0 ± 0%   488.0 ± 0%  -64.74% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                     8683.0 ± 0%   488.0 ± 0%  -94.38% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                   82424.0 ± 0%   488.0 ± 0%  -99.41% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                  8680.0 ± 0%   488.0 ± 0%  -94.38% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                         480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                        480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                       480.0 ± 0%   472.0 ± 0%   -1.67% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                      481.0 ± 0%   472.0 ± 0%   -1.87% (p=0.001 n=7)
geomean                                                               1.131Ki        483.4       -58.26%benchmark-inhibit-base.txtbenchmark-inhibit-fixed.txt    │
                                                      │         allocs/opallocs/op   vs baseMutes/1_inhibition_rule,_1_inhibiting_alert-12                          11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10_inhibition_rules,_1_inhibiting_alert-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1_inhibiting_alert-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-12                     11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-12                     11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-12                   11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10_inhibition_rules,_last_rule_matches-12                         11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/100_inhibition_rules,_last_rule_matches-12                        11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/1000_inhibition_rules,_last_rule_matches-12                       11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
Mutes/10000_inhibition_rules,_last_rule_matches-12                      11.00 ± 0%   10.00 ± 0%  -9.09% (p=0.001 n=7)
geomean                                                                 11.00        10.00       -9.09%

@siavashs siavashs force-pushed the fix/improve-inhibition-performance branch 3 times, most recently from a91f2cf to f2a5647 Compare October 13, 2025 15:59
@siavashs
Copy link
Contributor Author

siavashs commented Oct 13, 2025

There is a flaky test which I narrowed to this case:

		{
			name: "matching and unresolved",
			initial: map[model.Fingerprint]*types.Alert{
				1: {
					Alert: model.Alert{
						Labels:   model.LabelSet{"a": "b", "c": "d"},
						StartsAt: now.Add(-time.Minute),
						EndsAt:   now.Add(-time.Second),
					},
				},
				2: {
					Alert: model.Alert{
						Labels:   model.LabelSet{"a": "b", "c": "f"},
						StartsAt: now.Add(-time.Minute),
						EndsAt:   now.Add(time.Hour),
					},
				},
			},
			equal:  model.LabelNames{"a"},
			input:  model.LabelSet{"a": "b"},
			result: true,
		},

The test is flaky since we use a map and the alerts are not inserted in order so 1 can overwrite 2 sometimes.
While this could happen, we are not replicating the correct behaviour of Alertmanager here which is only fresh alerts being written into inhibitor and constantly updated by heartbeats.

The test can be made consistent by simply switching to a slice but I wanted to point out the above here just in case I'm missing something.

@siavashs
Copy link
Contributor Author

siavashs commented Oct 13, 2025

Here is some data from our internal custom metrics in Alertmanager Inhibitor.
image
As you can see inhibitor works much faster now in all cases:

  • target alert matching an inhibit rule and getting muted, best case, depends on inhibit rule index and count of source alerts
  • target alert matching and inhibit rule but not getting muted, worst case, depends on number of inhibit rules and count of source alerts
  • all alert not muted (matched or not matched), worse and worst cases, depends on number of inhibit rules and count of source alerts

Custom metrics need some fixes, and can be included in this PR or a separate one.

@siavashs
Copy link
Contributor Author

An idea to fix the flaky test and the logic would be to always compare a new source alert EndsAt with our existing index entry, and keep the one which is going to be active for longer.
It is an edge case but ensures better behaviour by inhibitor if user has multiple Prometheus instances with mismatching alerting configuration.

This change adds a new index per inhibition rule which:
1. extracts the subset of source alert labelset which are in equals
2. calculates the fingerprint of the above
3. maps the calculated fingerprint to the source alert fingerprint
4. performs the same calculation for target alerts
5. uses the above index to find the equal source alerts quickly

This significantly improves the inhibition performance, since there
is no need to loop over all source alerts and the equal labels.

The equals index items are garbage collected by callback from `scache`.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the fix/improve-inhibition-performance branch from f2a5647 to e60f4f0 Compare October 15, 2025 08:55
@siavashs
Copy link
Contributor Author

Fixed the flaky test in the last change.

@SuperQ SuperQ requested a review from gotjosh October 17, 2025 15:45
Copy link
Member

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing the flaky test.

@SuperQ SuperQ merged commit 8479a85 into prometheus:main Oct 17, 2025
11 checks passed
@siavashs siavashs deleted the fix/improve-inhibition-performance branch October 18, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow inhibition

2 participants