Skip to content

Slow inhibition #4606

@siavashs

Description

@siavashs

Alertmanager struggles under high volume of source alerts.
We were able to track this down to this method using custom metrics:

func (r *InhibitRule) hasEqual(lset model.LabelSet, excludeTwoSidedMatch bool) (model.Fingerprint, bool) {
now := time.Now()
Outer:
for _, a := range r.scache.List() {
// The cache might be stale and contain resolved alerts.
if a.ResolvedAt(now) {
continue
}
for n := range r.Equal {
if a.Labels[n] != lset[n] {
continue Outer
}
}
if excludeTwoSidedMatch && r.TargetMatchers.Matches(a.Labels) {
continue Outer
}
return a.Fingerprint(), true
}
return model.Fingerprint(0), false
}

An alert must be check against:

  • every inhibit rule
    • every alert in inhibit rule source cache
      • every label from equals

The above can result in an exponential increase in the number of checks to inhibit an alert.

Here are the custom metrics measuring how much time it take for an alertmanager with ~30-40 inhibit rules and 100s of alerts in source caches of those rules to check a potential source alert for inhbition and NOT inhibit the alerts:

Image

The query used is:

sum by (instance) (
    increase(alertmanager_inhibitor_rule_mutes_duration_seconds_sum{rule="none"}[5m])
)

And here are the results of existing benchmarks (only including relevant benchmark case with multiple rules):

goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Pro
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12  	  156823	      7628 ns/op	    8600 B/op	       8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12       	  160782	      7367 ns/op	     416 B/op	       8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12      	   16780	     68020 ns/op	     416 B/op	       8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12     	    1761	    676124 ns/op	     417 B/op	       8 allocs/op

We have developed a small patch at Cloudflare which significantly improves the performance of inhbition, a PR will be submitted for that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions