Skip to content

ruler: ignore rulers in non-operation states when getting and syncing rules #11569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 3, 2025

Conversation

fionaliao
Copy link
Contributor

@fionaliao fionaliao commented May 28, 2025

What this PR does

Currently, forEachRulerInTheRing() fails if there are rulers in the ring in states other than the states in the operation argument. This is because forEachRulerInTheRing() calls GetReplicationSetForOperation(). Any ruler in a non-operation state is seen as unhealthy by GetReplicationSetForOperation(). As RF = 1, the number of healthy instances must match the total number of instances in the ring. Therefore GetReplicationSetForOperation() errors and that's propagated to forEachRulerInTheRing().

However, it's possible for all rules to be processed (evaluated or synced) correctly even when there are "unhealthy" instances. This is because rules are only processed on healthy instances and are moved around rulers depending on the rulers that are healthy. This causes some endpoints to return errors more often than necessary.

As a concrete example, api/v1/rules and /api/v1/alerts will return errors if there are any JOINING or LEAVING rulers in the ring. Rulers can be in these state during rollouts, or scaleups and scaledowns. These endpoints use GetRules() which calls forEachRulerInTheRing() with RuleEvalRingOp. RuleEvalRingOp only considers ACTIVE rulers as healthy.

Instead of erroring when there are non-ACTIVE rulers in the ring, it's safe to ignore non-ACTIVE rulers in this case:

  • A ruler does not start evaluation rules until it's transitioned from JOINING to ACTIVE. And ACTIVE rulers ignore JOINING rulers when deciding which rules to execute. So JOINING rulers can be ignored when querying for rules.
  • When a ruler is stopped, it stops all rule evaluations before it transitions to the LEAVING state - as per this code, the rules manager is stopped and then the subservices (including the lifecycler) are stopped. This means that if a ruler's in the LEAVING state it can't be evaluating rules, so it can safely ignored when getting rules.
  • It can take other rules a while to detect a ruler is LEAVING or has left the ring, so there can be a pause in rule evaluation for some rules in this time. This happens with the current implementation anyway; filtering for only ACTIVE instances won't make this worse.

The ring can be eventually consistent (e.g. if using memberlist). Therefore rulers could transition between states and redistribute rules during the process of getting rules (e.g. ruler is JOINING but transitions to ACTIVE and rulers are redistributed). So it's possible to get a partial set of results if rulers in other states are ignored. But there are already similar cases happening now, like if a ruler is not present in the ring when we get the list of rulers to get rules from, but joins the ring and becomes active while getting rules.

To implement this, I've introduced a new ring function, GetSubringForOperationStates() in grafana/dskit#699. This is used to filter out the instances in the non-operation states (for the GetRules() case, this will filter out all the non-ACTIVE instances) before GetReplicationSetForOperation() is called. The effect is that GetReplicationSetForOperation()` will only error if there are instances in the operation states that have timed out heartbeats.

I also considered using GetAllHealthy(), rather than introducing a new function. However, GetAllHealthy() ignores out rulers that are in the requested states but have timed out heartbeats rather than than erroring. While rulers being in JOINING and LEAVING states are expected from time to time due to rollouts and scaling, if everything is working as expected, active rulers should not end up with heartbeat timeouts.

Another alternative was to add retries to forEachRulerInTheRing() - we don't expect the ring to always have rulers in the non-operation states. However, in the case of GetRules(), rulers can take a long time to transition from JOINING to ACTIVE if there are many rules to load, so retries would either be ineffective or have high latency.

As well as GetRules(), forEachRulerInTheRing() is used by notifySyncRules(), which notifies all the JOINING and ACTIVE rulers to sync rules. Previously this would fail if any rulers were in the LEAVING state before notifying any ruler to sync rules. It's safe to not notify LEAVING rulers as they do not pick up new rules. This new behaviour ends up being an improvement as other rulers will still be notified when there are LEAVING rulers.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@fionaliao fionaliao force-pushed the fl/subring-get-rules branch 2 times, most recently from b12cbb4 to c3e85f9 Compare June 2, 2025 15:27
@fionaliao fionaliao changed the title ruler: ignore rulers in non-operation states in forEachRulerInTheRing ruler: ignore rulers in non-operation states when getting and syncing rules Jun 2, 2025
@fionaliao fionaliao marked this pull request as ready for review June 2, 2025 17:18
@fionaliao fionaliao requested review from a team as code owners June 2, 2025 17:18
Comment on lines +577 to +602
"Shuffle Shard Size 2 with 1 joining ruler": {
shuffleShardSize: 2,
tokensByRuler: map[string][]uint32{
"ruler1": append(
[]uint32{userToken("user1", 0) + 1},
append(
generateTokenForGroups(1, rules[0], rules[1]),
generateTokenForGroups(3, rules[2])...)...,
),
"ruler2": append( // ruler2 is in JOINING state so will be ignored when distributing rules.
[]uint32{userToken("user1", 1) + 1, userToken("user2", 0) + 1},
generateTokenForGroups(1, rules[2])...,
),
"ruler3": append(
[]uint32{userToken("user2", 1) + 1, userToken("user3", 0) + 1},
generateTokenForGroups(1, rules[3], rules[4], rules[5], rules[6], rules[7], rules[8])...,
),
// While ruler4 registers some tokens for all users, it doesn't actually evaluate any rules.
// ruler1 and ruler2 are selected for subring for user1 and user2, even though ruler2 is in the JOINING
// state (this means all the rules are evaluated on ruler1).
// This does form part of user3's subring but user3's single rule is evaluated on ruler3.
"ruler4": append(
[]uint32{userToken("user1", 2) + 1, userToken("user2", 2) + 1, userToken("user3", 1) + 1},
generateTokenForGroups(2, rules[2])...,
),
},
Copy link
Contributor Author

@fionaliao fionaliao Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to check if the result of this test case is expected - when getting the shuffle sharded subring for a user, non-ACTIVE rulers are not filtered out. Rules do end up being evaluated on the ACTIVE rulers in the subring though.

The issue I see is that it's possible during rollout or scaling for all the rulers in a shuffle sharded subring to be in a non-ACTIVE state, in which case all rules will not be evaluated for a user.

This was happening before my PR as well. This could be fixed (if it is an issue) by moving GetSubringForOperationStates(op) before calling ShuffleShard() on the ring, so the shuffle sharded ring is made from only the instances in the requested state. Would need to be careful to update the logic in all the places that gets the shuffled sharded subring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking exactly at this test and then I noticed your comment. I'm glad we both put our 👀 on this.

I agree this behaviour in related from your PR. Is it desired? It depends. Generally speaking, this is the expected behaviour of shuffle sharding in ingesters (where shuffle sharding has been introduced at the beginning): the goal is to not reshard series among different ingesters during a rolling update. I also think it's a desired property for the store-gateways shuffle sharding.

Is it desired for rulers? I don't know right now (I need to think about it more). Synching rules take time. Given rulers unregister from the ring during a restart, changing this logic would realistically just cover the time it takes for the initial sync (when the ruler is in JOINING state). Moving rule groups to the remaining ACTIVE ruler in the shard (if any) or to another ruler outside the original tenant's shard may not make much difference. The only real difference may be in case the shard size = 1 (if the shard is composed by the JOINING ruler, then no rules would be evaluated while JOINING). However, synching rules on other rules will take time too, so not sure would be a significant improvement changing this logic.

@pracucci pracucci self-requested a review June 3, 2025 04:41
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love 1 line logic changes. Thanks!

@fionaliao fionaliao force-pushed the fl/subring-get-rules branch from a889411 to 3739b62 Compare June 3, 2025 09:06
@fionaliao fionaliao merged commit 1f1a18a into main Jun 3, 2025
30 checks passed
@fionaliao fionaliao deleted the fl/subring-get-rules branch June 3, 2025 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants