Skip to content

Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy. #5866

Closed
@emanlodovice

Description

@emanlodovice

Is your feature request related to a problem? Please describe.
Currently one way to count the number of rule groups for a given tenant is to count the unique rule_group labels using any of the per rule group metrics such as cortex_prometheus_rule_group_rules. This gives us an accurate count of rule groups per tenant when all rulers are up and running. But in the event when rulers become unhealthy we will not get metrics from the unhealthy rulers so the count of unique rule_group labels using any of the per rule group metric will not be an accurate number anymore. And because there is no metric containing the exact count of rule groups per tenant in the storage it is very difficult to determine the impact in terms of number of affected rule groups when a ruler becomes unhealthy (or when rulers did not load specific rule groups maybe during resharding).

Describe the solution you'd like
Create a new metric for the count of rule groups per tenant in the storage. All rulers can emit this metric for all tenants that includes it in its sub-ring so we don't lose the metric when some rulers go down. The count of the rule groups per tenant is already available during sync rules operation https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L688

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/rulesBits & bobs todo with rules and alerts: the ruler, config service etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions