-
Couldn't load subscription status.
- Fork 2.3k
Description
Alertmanager struggles under high volume of source alerts.
We were able to track this down to this method using custom metrics:
alertmanager/inhibit/inhibit.go
Lines 232 to 251 in 989d79b
| func (r *InhibitRule) hasEqual(lset model.LabelSet, excludeTwoSidedMatch bool) (model.Fingerprint, bool) { | |
| now := time.Now() | |
| Outer: | |
| for _, a := range r.scache.List() { | |
| // The cache might be stale and contain resolved alerts. | |
| if a.ResolvedAt(now) { | |
| continue | |
| } | |
| for n := range r.Equal { | |
| if a.Labels[n] != lset[n] { | |
| continue Outer | |
| } | |
| } | |
| if excludeTwoSidedMatch && r.TargetMatchers.Matches(a.Labels) { | |
| continue Outer | |
| } | |
| return a.Fingerprint(), true | |
| } | |
| return model.Fingerprint(0), false | |
| } |
An alert must be check against:
- every inhibit rule
- every alert in inhibit rule source cache
- every label from
equals
- every label from
- every alert in inhibit rule source cache
The above can result in an exponential increase in the number of checks to inhibit an alert.
Here are the custom metrics measuring how much time it take for an alertmanager with ~30-40 inhibit rules and 100s of alerts in source caches of those rules to check a potential source alert for inhbition and NOT inhibit the alerts:
The query used is:
sum by (instance) (
increase(alertmanager_inhibitor_rule_mutes_duration_seconds_sum{rule="none"}[5m])
)
And here are the results of existing benchmarks (only including relevant benchmark case with multiple rules):
goos: darwin
goarch: arm64
pkg: github.com/prometheus/alertmanager/inhibit
cpu: Apple M3 Pro
BenchmarkMutes/100_inhibition_rules,_1000_inhibiting_alerts-12 156823 7628 ns/op 8600 B/op 8 allocs/op
BenchmarkMutes/100_inhibition_rules,_last_rule_matches-12 160782 7367 ns/op 416 B/op 8 allocs/op
BenchmarkMutes/1000_inhibition_rules,_last_rule_matches-12 16780 68020 ns/op 416 B/op 8 allocs/op
BenchmarkMutes/10000_inhibition_rules,_last_rule_matches-12 1761 676124 ns/op 417 B/op 8 allocs/opWe have developed a small patch at Cloudflare which significantly improves the performance of inhbition, a PR will be submitted for that.