Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy.

**Is your feature request related to a problem? Please describe.**
Currently one way to count the number of rule groups for a given tenant is to count the unique `rule_group` labels using any of the per rule group metrics such as `cortex_prometheus_rule_group_rules`. This gives us an accurate count of rule groups per tenant when all rulers are up and running. But in the event when rulers become unhealthy we will not get metrics from the unhealthy rulers so the count of unique `rule_group` labels using any of the per rule group metric will not be an accurate number anymore. And because there is no metric containing the exact count of rule groups per tenant in the storage it is very difficult to determine the impact in terms of number of affected rule groups when a ruler becomes unhealthy (or when rulers did not load specific rule groups maybe during resharding).

**Describe the solution you'd like**
Create a new metric for the count of rule groups per tenant in the storage. All rulers can emit this metric for all tenants that includes it in its sub-ring so we don't lose the metric when some rulers go down. The count of the rule groups per tenant is already available during sync rules operation https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L688


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy. #5866

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better mechanism to detect impact in terms of the number of rule groups when rulers become unhealthy. #5866

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions