[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.#5021
[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.#5021Mwea wants to merge 4 commits intoprometheus:mainfrom
Conversation
Add BenchmarkMutesScaling with three cases to measure how Mutes() performance scales with rule count: - different_targets: Each rule has unique target matcher, only one matches the alert (best case for selective lookup) - same_target: All rules have same target matcher, all must be checked - no_match: Alert matches no rule's target, all must be checked These benchmarks establish baseline performance for potential optimizations to the inhibition rule matching logic. Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Extract the core inhibition checking logic into a separate checkInhibit method. This separates concerns: - Mutes(): handles tracing span lifecycle and marker updates - checkInhibit(): contains the rule iteration logic This refactoring prepares for future optimizations to the rule matching logic without changing the public API or tracing behavior. No functional changes. Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Add ruleIndex to Inhibitor for O(k) rule lookup instead of O(N) linear scan, where k = number of labels and N = number of inhibit rules. When the index IS effective (O(1) lookup): - Rules have different equality target matchers (e.g., cluster=X) - Alert labels allow direct lookup into the index - Example: 1000 rules each targeting different clusters, checking an alert for cluster=999 → only examines 1 rule instead of 1000 When the index is NOT effective (falls back to O(N) scan): - All rules share the same target matchers (e.g., all target dst=0) - Rules use regex or not-equal matchers (cannot be indexed) - High-overlap matchers excluded from indexing (>50% of rules) Implementation details: - Index rules by exact match target matchers at construction time - Use callback pattern (forEachCandidate) to avoid slice allocation - Pool visited map to reduce GC pressure - Skip deduplication for single-matcher rules Benchmark results (BenchmarkMutesScaling, 1000 rules): different_targets: 32µs → 2.1µs (15x faster, index effective) no_match: 15µs → 1.0µs (15x faster, index effective) same_target: 218µs → 209µs (no change, index not effective) Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Replace hardcoded constants with RuleIndexOptions struct to allow
testing different threshold values.
Benchmark results for MinRulesForIndex (ns/op):
rules | linear | indexed
1 | 17 | 17
2 | 29 | 85
5 | 68 | 84
10 | 135 | 94
Crossover at ~7 rules. Default of 2 enables indexing early since
high-overlap detection handles pathological cases.
Benchmark results for MaxMatcherOverlapRatio (ns/op):
ratio | time
0.10 | 183
0.50 | 186
0.60 | 552
1.00 | 571
Clear cliff between 0.5 and 0.6 with 3x degradation. Default of 0.5
is optimal - highest value before performance degrades.
Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
ultrotter
left a comment
There was a problem hiding this comment.
Approved, with some minor comments/nits
| ih := NewInhibitor(s, rules, m, promslog.NewNopLogger()) | ||
| defer ih.Stop() | ||
| go ih.Run() | ||
| <-time.After(time.Second) |
There was a problem hiding this comment.
Should we do ih.WaitForLoading() here instead of effectively a sleep?
| } | ||
|
|
||
| // RuleIndexOptions configures the rule index behavior. | ||
| type RuleIndexOptions struct { |
There was a problem hiding this comment.
For this and the one below DefaultRuleIndexOptions, should we consider private as well? Or do we have a reason to export?
| } | ||
|
|
||
| func TestForEachCandidate_EarlyTermination(t *testing.T) { | ||
| rules := []*InhibitRule{ |
There was a problem hiding this comment.
Should this test have more than 2 rules to also test the early termination via index? Or is it not necessary?
Thanks ! I will tackle this tomorrow. |
We already have some caches in the silencer to improve performace, plus you have the situation that the rules would not be static based on the config, right? This would mean extra locking to make it work which may or may not create an issue? Let's focus on the current patches, then we'll see what we can do if we see that you have scalability issues there! :) |
Pull Request Checklist
Please check all the applicable boxes.
project
[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.
This PR aAdd an inverted index for inhibit rule target matcher lookup to achieve O(k) rule selection instead of O(N) linear scan, where k = number of labels on the alert and N = number of inhibit rules.
Mutes()to extract core checking logic intocheckInhibit()ruleIndexwith configurable thresholds for index constructionMotivation
When alertmanager has many inhibit rules (e.g., hundreds or thousands), the current implementation checks every rule for every alert, resulting in O(N) complexity. In environments with rules targeting different label values (e.g., per-cluster or per-service rules), most of this work is wasted.
Benchmark Results
Summary
N = number of rules, k = number of labels on alert
Details
When the index is effective (O(1) lookup):
When the index falls back to O(N) scan: