Skip to content

[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.#5021

Open
Mwea wants to merge 4 commits intoprometheus:mainfrom
Mwea:perf/invertex-index-inhibit
Open

[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.#5021
Mwea wants to merge 4 commits intoprometheus:mainfrom
Mwea:perf/invertex-index-inhibit

Conversation

@Mwea
Copy link

@Mwea Mwea commented Feb 20, 2026

Pull Request Checklist

Please check all the applicable boxes.

  • Please list all open issue(s) discussed with maintainers related to this change
    • N/A
  • Is this a new Receiver integration?
    • N/A
  • Is this a bugfix?
    • N/A
  • Is this a new feature?
    • I have added tests that test the new feature's functionality
  • Does this change affect performance?
    • I have provided benchmarks comparison that shows performance is improved or is not degraded
    • I have added new benchmarks if required or requested by maintainers
  • Is this a breaking change?
    • My changes do not break the existing cluster messages
    • My changes do not break the existing api
  • I have added/updated the required documentation
  • I have signed-off my commits
  • I will follow best practices for contributing to this
    project

[PERF] Inhibitor: Add inverted index for O(k) rule lookup instead of O(N) linear scan.

This PR aAdd an inverted index for inhibit rule target matcher lookup to achieve O(k) rule selection instead of O(N) linear scan, where k = number of labels on the alert and N = number of inhibit rules.

  • Add benchmarks to measure scaling behavior with different rule distributions
  • Refactor Mutes() to extract core checking logic into checkInhibit()
  • Implement ruleIndex with configurable thresholds for index construction

Motivation

When alertmanager has many inhibit rules (e.g., hundreds or thousands), the current implementation checks every rule for every alert, resulting in O(N) complexity. In environments with rules targeting different label values (e.g., per-cluster or per-service rules), most of this work is wasted.

Benchmark Results

Summary

Scenario Before After Improvement Complexity (before) Complexity (after)
different_targets 28.2µs 1.6µs 18x faster O(N) O(k)
no_match 11.5µs 0.8µs 15x faster O(N) O(k)
same_target 143µs 136µs ~same O(N) O(N) fallback

N = number of rules, k = number of labels on alert

Details

Benchmark Baseline (ns/op) HEAD (ns/op) Delta
BenchmarkMutesScaling/different_targets/rules=10 1708 1524 -10.8%
BenchmarkMutesScaling/different_targets/rules=100 4023 1523 -62.1%
BenchmarkMutesScaling/different_targets/rules=1000 28231 1565 -94.5%
BenchmarkMutesScaling/same_target/rules=10 2439 2396 -1.8%
BenchmarkMutesScaling/same_target/rules=100 14970 14380 -3.9%
BenchmarkMutesScaling/same_target/rules=1000 143224 135880 -5.1%
BenchmarkMutesScaling/no_match/rules=10 760 744 -2.1%
BenchmarkMutesScaling/no_match/rules=100 1756 739 -57.9%
BenchmarkMutesScaling/no_match/rules=1000 11492 778 -93.2%

When the index is effective (O(1) lookup):

  • Rules have different equality target matchers (e.g., cluster=X)
  • Alert labels allow direct lookup into the index

When the index falls back to O(N) scan:

  • All rules share the same target matcher (high overlap)
  • Rules use regex matchers only

Mwea added 4 commits February 20, 2026 13:10
Add BenchmarkMutesScaling with three cases to measure how Mutes()
performance scales with rule count:

- different_targets: Each rule has unique target matcher, only one
  matches the alert (best case for selective lookup)
- same_target: All rules have same target matcher, all must be checked
- no_match: Alert matches no rule's target, all must be checked

These benchmarks establish baseline performance for potential
optimizations to the inhibition rule matching logic.

Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Extract the core inhibition checking logic into a separate checkInhibit
method. This separates concerns:

- Mutes(): handles tracing span lifecycle and marker updates
- checkInhibit(): contains the rule iteration logic

This refactoring prepares for future optimizations to the rule matching
logic without changing the public API or tracing behavior.

No functional changes.

Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Add ruleIndex to Inhibitor for O(k) rule lookup instead of O(N) linear
scan, where k = number of labels and N = number of inhibit rules.

When the index IS effective (O(1) lookup):
- Rules have different equality target matchers (e.g., cluster=X)
- Alert labels allow direct lookup into the index
- Example: 1000 rules each targeting different clusters, checking
  an alert for cluster=999 → only examines 1 rule instead of 1000

When the index is NOT effective (falls back to O(N) scan):
- All rules share the same target matchers (e.g., all target dst=0)
- Rules use regex or not-equal matchers (cannot be indexed)
- High-overlap matchers excluded from indexing (>50% of rules)

Implementation details:
- Index rules by exact match target matchers at construction time
- Use callback pattern (forEachCandidate) to avoid slice allocation
- Pool visited map to reduce GC pressure
- Skip deduplication for single-matcher rules

Benchmark results (BenchmarkMutesScaling, 1000 rules):

  different_targets: 32µs → 2.1µs  (15x faster, index effective)
  no_match:          15µs → 1.0µs  (15x faster, index effective)
  same_target:      218µs → 209µs  (no change, index not effective)

Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
Replace hardcoded constants with RuleIndexOptions struct to allow
testing different threshold values.

Benchmark results for MinRulesForIndex (ns/op):

  rules | linear | indexed
     1  |     17 |      17
     2  |     29 |      85
     5  |     68 |      84
    10  |    135 |      94

Crossover at ~7 rules. Default of 2 enables indexing early since
high-overlap detection handles pathological cases.

Benchmark results for MaxMatcherOverlapRatio (ns/op):

  ratio | time
   0.10 |  183
   0.50 |  186
   0.60 |  552
   1.00 |  571

Clear cliff between 0.5 and 0.6 with 3x degradation. Default of 0.5
is optimal - highest value before performance degrades.

Signed-off-by: Titouan Chary <titouan.chary@aiven.io>
@Mwea Mwea marked this pull request as ready for review February 20, 2026 16:25
Copy link
Contributor

@ultrotter ultrotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, with some minor comments/nits

ih := NewInhibitor(s, rules, m, promslog.NewNopLogger())
defer ih.Stop()
go ih.Run()
<-time.After(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do ih.WaitForLoading() here instead of effectively a sleep?

}

// RuleIndexOptions configures the rule index behavior.
type RuleIndexOptions struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the one below DefaultRuleIndexOptions, should we consider private as well? Or do we have a reason to export?

}

func TestForEachCandidate_EarlyTermination(t *testing.T) {
rules := []*InhibitRule{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test have more than 2 rules to also test the early termination via index? Or is it not necessary?

@Mwea
Copy link
Author

Mwea commented Feb 26, 2026

Approved, with some minor comments/nits

Thanks ! I will tackle this tomorrow.
I had some thoughts on proposing a similar solution for the Silencer , but I don't really know it would bring that much benefits as for this one. Any thoughts ? @ultrotter

@ultrotter
Copy link
Contributor

Approved, with some minor comments/nits

Thanks ! I will tackle this tomorrow. I had some thoughts on proposing a similar solution for the Silencer , but I don't really know it would bring that much benefits as for this one. Any thoughts ? @ultrotter

We already have some caches in the silencer to improve performace, plus you have the situation that the rules would not be static based on the config, right? This would mean extra locking to make it work which may or may not create an issue? Let's focus on the current patches, then we'll see what we can do if we see that you have scalability issues there! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants