ruler: ignore rulers in non-operation states when getting and syncing rules #11569

fionaliao · 2025-05-28T05:56:11Z

What this PR does

Currently, forEachRulerInTheRing() fails if there are rulers in the ring in states other than the states in the operation argument. This is because forEachRulerInTheRing() calls GetReplicationSetForOperation(). Any ruler in a non-operation state is seen as unhealthy by GetReplicationSetForOperation(). As RF = 1, the number of healthy instances must match the total number of instances in the ring. Therefore GetReplicationSetForOperation() errors and that's propagated to forEachRulerInTheRing().

However, it's possible for all rules to be processed (evaluated or synced) correctly even when there are "unhealthy" instances. This is because rules are only processed on healthy instances and are moved around rulers depending on the rulers that are healthy. This causes some endpoints to return errors more often than necessary.

As a concrete example, api/v1/rules and /api/v1/alerts will return errors if there are any JOINING or LEAVING rulers in the ring. Rulers can be in these state during rollouts, or scaleups and scaledowns. These endpoints use GetRules() which calls forEachRulerInTheRing() with RuleEvalRingOp. RuleEvalRingOp only considers ACTIVE rulers as healthy.

Instead of erroring when there are non-ACTIVE rulers in the ring, it's safe to ignore non-ACTIVE rulers in this case:

A ruler does not start evaluation rules until it's transitioned from JOINING to ACTIVE. And ACTIVE rulers ignore JOINING rulers when deciding which rules to execute. So JOINING rulers can be ignored when querying for rules.
When a ruler is stopped, it stops all rule evaluations before it transitions to the LEAVING state - as per this code, the rules manager is stopped and then the subservices (including the lifecycler) are stopped. This means that if a ruler's in the LEAVING state it can't be evaluating rules, so it can safely ignored when getting rules.
It can take other rules a while to detect a ruler is LEAVING or has left the ring, so there can be a pause in rule evaluation for some rules in this time. This happens with the current implementation anyway; filtering for only ACTIVE instances won't make this worse.

The ring can be eventually consistent (e.g. if using memberlist). Therefore rulers could transition between states and redistribute rules during the process of getting rules (e.g. ruler is JOINING but transitions to ACTIVE and rulers are redistributed). So it's possible to get a partial set of results if rulers in other states are ignored. But there are already similar cases happening now, like if a ruler is not present in the ring when we get the list of rulers to get rules from, but joins the ring and becomes active while getting rules.

To implement this, I've introduced a new ring function, GetSubringForOperationStates() in grafana/dskit#699. This is used to filter out the instances in the non-operation states (for the GetRules() case, this will filter out all the non-ACTIVE instances) before GetReplicationSetForOperation() is called. The effect is that GetReplicationSetForOperation()` will only error if there are instances in the operation states that have timed out heartbeats.

I also considered using GetAllHealthy(), rather than introducing a new function. However, GetAllHealthy() ignores out rulers that are in the requested states but have timed out heartbeats rather than than erroring. While rulers being in JOINING and LEAVING states are expected from time to time due to rollouts and scaling, if everything is working as expected, active rulers should not end up with heartbeat timeouts.

Another alternative was to add retries to forEachRulerInTheRing() - we don't expect the ring to always have rulers in the non-operation states. However, in the case of GetRules(), rulers can take a long time to transition from JOINING to ACTIVE if there are many rules to load, so retries would either be ineffective or have high latency.

As well as GetRules(), forEachRulerInTheRing() is used by notifySyncRules(), which notifies all the JOINING and ACTIVE rulers to sync rules. Previously this would fail if any rulers were in the LEAVING state before notifying any ruler to sync rules. It's safe to not notify LEAVING rulers as they do not pick up new rules. This new behaviour ends up being an improvement as other rulers will still be notified when there are LEAVING rulers.

Which issue(s) this PR fixes or relates to

~~Fixes #~~

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

pkg/ruler/ruler_test.go

fionaliao · 2025-06-02T17:27:11Z

pkg/ruler/ruler_test.go

+		"Shuffle Shard Size 2 with 1 joining ruler": {
+			shuffleShardSize: 2,
+			tokensByRuler: map[string][]uint32{
+				"ruler1": append(
+					[]uint32{userToken("user1", 0) + 1},
+					append(
+						generateTokenForGroups(1, rules[0], rules[1]),
+						generateTokenForGroups(3, rules[2])...)...,
+				),
+				"ruler2": append( // ruler2 is in JOINING state so will be ignored when distributing rules.
+					[]uint32{userToken("user1", 1) + 1, userToken("user2", 0) + 1},
+					generateTokenForGroups(1, rules[2])...,
+				),
+				"ruler3": append(
+					[]uint32{userToken("user2", 1) + 1, userToken("user3", 0) + 1},
+					generateTokenForGroups(1, rules[3], rules[4], rules[5], rules[6], rules[7], rules[8])...,
+				),
+				// While ruler4 registers some tokens for all users, it doesn't actually evaluate any rules.
+				// ruler1 and ruler2 are selected for subring for user1 and user2, even though ruler2 is in the JOINING
+				// state (this means all the rules are evaluated on ruler1).
+				// This does form part of user3's subring but user3's single rule is evaluated on ruler3.
+				"ruler4": append(
+					[]uint32{userToken("user1", 2) + 1, userToken("user2", 2) + 1, userToken("user3", 1) + 1},
+					generateTokenForGroups(2, rules[2])...,
+				),
+			},


Wanted to check if the result of this test case is expected - when getting the shuffle sharded subring for a user, non-ACTIVE rulers are not filtered out. Rules do end up being evaluated on the ACTIVE rulers in the subring though.

The issue I see is that it's possible during rollout or scaling for all the rulers in a shuffle sharded subring to be in a non-ACTIVE state, in which case all rules will not be evaluated for a user.

This was happening before my PR as well. This could be fixed (if it is an issue) by moving GetSubringForOperationStates(op) before calling ShuffleShard() on the ring, so the shuffle sharded ring is made from only the instances in the requested state. Would need to be careful to update the logic in all the places that gets the shuffled sharded subring.

I was looking exactly at this test and then I noticed your comment. I'm glad we both put our 👀 on this.

I agree this behaviour in related from your PR. Is it desired? It depends. Generally speaking, this is the expected behaviour of shuffle sharding in ingesters (where shuffle sharding has been introduced at the beginning): the goal is to not reshard series among different ingesters during a rolling update. I also think it's a desired property for the store-gateways shuffle sharding.

Is it desired for rulers? I don't know right now (I need to think about it more). Synching rules take time. Given rulers unregister from the ring during a restart, changing this logic would realistically just cover the time it takes for the initial sync (when the ruler is in JOINING state). Moving rule groups to the remaining ACTIVE ruler in the shard (if any) or to another ruler outside the original tenant's shard may not make much difference. The only real difference may be in case the shard size = 1 (if the shard is composed by the JOINING ruler, then no rules would be evaluated while JOINING). However, synching rules on other rules will take time too, so not sure would be a significant improvement changing this logic.

pkg/ruler/ruler_test.go

pracucci

I love 1 line logic changes. Thanks!

fionaliao mentioned this pull request May 28, 2025

Add GetSubringForOperationStates method to Ring grafana/dskit#699

Merged

2 tasks

fionaliao commented May 29, 2025

View reviewed changes

pkg/ruler/ruler_test.go Outdated Show resolved Hide resolved

fionaliao force-pushed the fl/subring-get-rules branch 2 times, most recently from b12cbb4 to c3e85f9 Compare June 2, 2025 15:27

fionaliao changed the title ~~ruler: ignore rulers in non-operation states in forEachRulerInTheRing~~ ruler: ignore rulers in non-operation states when getting and syncing rules Jun 2, 2025

fionaliao marked this pull request as ready for review June 2, 2025 17:18

fionaliao requested review from a team as code owners June 2, 2025 17:18

fionaliao commented Jun 2, 2025

View reviewed changes

tacole02 approved these changes Jun 3, 2025

View reviewed changes

pracucci self-requested a review June 3, 2025 04:41

pracucci reviewed Jun 3, 2025

View reviewed changes

pkg/ruler/ruler_test.go Outdated Show resolved Hide resolved

pkg/ruler/ruler_test.go Outdated Show resolved Hide resolved

pracucci approved these changes Jun 3, 2025

View reviewed changes

fionaliao added 12 commits June 3, 2025 10:06

Do subring op

26e6d28

Update TestGetRules

aca9186

Check actual rules returned from GetRules

e0b0986

Tidy comments

fc2c97c

Fix vendored lib

efbd5ab

Add additional tests

0ce2be7

Start writing notifySyncRules test

f285c2e

Complete sync test

b75b050

Update CHANGELOG

4209f81

Lint

4484b8e

More lint

c90f47b

Updates from PR comments

3739b62

fionaliao force-pushed the fl/subring-get-rules branch from a889411 to 3739b62 Compare June 3, 2025 09:06

fionaliao merged commit 1f1a18a into main Jun 3, 2025
30 checks passed

fionaliao deleted the fl/subring-get-rules branch June 3, 2025 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ruler: ignore rulers in non-operation states when getting and syncing rules #11569

ruler: ignore rulers in non-operation states when getting and syncing rules #11569

Uh oh!

fionaliao commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

fionaliao Jun 2, 2025 •

edited

Loading

Uh oh!

pracucci Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Uh oh!

Uh oh!

Uh oh!

ruler: ignore rulers in non-operation states when getting and syncing rules #11569

ruler: ignore rulers in non-operation states when getting and syncing rules #11569

Uh oh!

Conversation

fionaliao commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

Uh oh!

Uh oh!

fionaliao Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pracucci Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pracucci left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fionaliao commented May 28, 2025 •

edited

Loading

fionaliao Jun 2, 2025 •

edited

Loading