Open
Description
Description
Currently Prometheus rule manager only restores for
state of rule groups after restarts. This is fine for Prometheus. However, in Cortex, rule groups can jump from one ruler instance (r1) to another (r2) due to resharding. If r2 happens to be evaluating rule groups for that tenant already, then the manager will not restore the for
state and will result in alerts going into an incorrect state. For example, an alert can go from FIRING
to PENDING
To Reproduce
- Create rules for a tenant with shard size > 1. For ease of testing, all the ruler instances were running rules for the tenant
- Wait for alerting rule to go into
FIRING
- Restart the instance that was evaluating the alerting rule. Here the assumption is the ruler takes a bit to restart giving another ruler a chance to evaluate the alerting rule at least once
- The alerting rule will go to
PENDING
Expected behavior
- The alert rule should stay in
FIRING
state
Additional Context
There is a PR open for Prometheus to address this issue. Without the PR approved, it is difficult to fix this issue