Skip to content

Ruler does not consistently restore for state #6465

Open
@rajagopalanand

Description

Description

Currently Prometheus rule manager only restores for state of rule groups after restarts. This is fine for Prometheus. However, in Cortex, rule groups can jump from one ruler instance (r1) to another (r2) due to resharding. If r2 happens to be evaluating rule groups for that tenant already, then the manager will not restore the for state and will result in alerts going into an incorrect state. For example, an alert can go from FIRING to PENDING

To Reproduce

  1. Create rules for a tenant with shard size > 1. For ease of testing, all the ruler instances were running rules for the tenant
  2. Wait for alerting rule to go into FIRING
  3. Restart the instance that was evaluating the alerting rule. Here the assumption is the ruler takes a bit to restart giving another ruler a chance to evaluate the alerting rule at least once
  4. The alerting rule will go to PENDING

Expected behavior

  • The alert rule should stay in FIRING state

Additional Context

There is a PR open for Prometheus to address this issue. Without the PR approved, it is difficult to fix this issue

Metadata

Assignees

No one assigned

    Labels

    component/rulesBits & bobs todo with rules and alerts: the ruler, config service etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions