Skip to content

Ruler does not consistently restore for state #6465

Closed
@rajagopalanand

Description

@rajagopalanand

Description

Currently Prometheus rule manager only restores for state of rule groups after restarts. This is fine for Prometheus. However, in Cortex, rule groups can jump from one ruler instance (r1) to another (r2) due to resharding. If r2 happens to be evaluating rule groups for that tenant already, then the manager will not restore the for state and will result in alerts going into an incorrect state. For example, an alert can go from FIRING to PENDING

To Reproduce

  1. Create rules for a tenant with shard size > 1. For ease of testing, all the ruler instances were running rules for the tenant
  2. Wait for alerting rule to go into FIRING
  3. Restart the instance that was evaluating the alerting rule. Here the assumption is the ruler takes a bit to restart giving another ruler a chance to evaluate the alerting rule at least once
  4. The alerting rule will go to PENDING

Expected behavior

  • The alert rule should stay in FIRING state

Additional Context

There is a PR open for Prometheus to address this issue. Without the PR approved, it is difficult to fix this issue

Metadata

Metadata

Labels

component/rulesBits & bobs todo with rules and alerts: the ruler, config service etc.help wanted

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions