Skip to content

Ruler HA - Proposal #5862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions docs/proposals/ruler-ha-new.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: "Ruler High Availability"
linkTitle: "Ruler High Availability"
weight: 1
slug: ruler-high-availability
---

- Author: [Anand Rajagopal](https://github.com/rajagopalanand)
- Date: Aug 2024
- Status: Proposed
---

## Problem

Rulers in Cortex currently run with a replication factor of 1, wherein each RuleGroup is assigned to exactly 1 ruler. This lack of redundancy creates the following risks:

- Rule group evaluation
- Missed evaluations due to a ruler outage, possibly caused by a deployment, noisy neighbour, hardware failure, etc.
- Missed evaluations due to a ruler brownout due to other tenant rule groups sharing the same ruler (noisy neighbour)
- API
- Inconsistent API results during resharding (e.g. due to a deployment) when rulers are in a transition state loading rule groups

This proposal attempts to mitigate the above risks by enabling a ruler replication factor of greater than 1, allowing multiple rulers to evaluate the same rule group — effectively.

## Proposal

### Make ReplicationFactor configurable

ReplicationFactor in Ruler is currently hardcoded to 1. Making this a configurable parameter is the first step to enabling HA in ruler. The parameter value will be 1 by default. To enable Ruler HA for rule group evaluation, a new flag will be created

A replication factor greater than 1 will result in the following

- Ring will pick R rulers for a rule group where R=RF
- The primary ruler (R1), when active, will take ownership of the rule group
- Non-primary ruler R2 will check if R1 is active. If R1 is not active, R2 will take ownership of the rule group
- Non-primary ruler R3 (if RF=3) will check if R1 and R2 are active. If they are both inactive/unhealthy, then R3 will take owership of the rule group
- Non-primary rulers will drop their ownership when R1 becomes active after an outage

With this redundancy, the maximum duration of missed evaluations will be limited to the sync interval of the rule groups, reducing the impact of primary Ruler unavailability.

### Prometheus change

No Prometheus change is required for this proposal

### API HA

An interim solution is addressed in this [#5773](https://github.com/cortexproject/cortex/issues/5773) PR. This will be modified such that the replicas will return both active and passive rule groups and the API handler will continue to de-duplicate the results.
The difference is that after Ruler HA, the replicas could potentially return proper rule group state if those replicas evaluated the rule group

PRs:

* For Rule evaluation [#6129](https://github.com/cortexproject/cortex/pull/6129)
* For API HA [#5773](https://github.com/cortexproject/cortex/issues/5773)
4 changes: 3 additions & 1 deletion docs/proposals/ruler-ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,13 @@ slug: ruler-ha

- Author: [Soon-Ping Phang](https://github.com/soonping-amzn)
- Date: June 2022
- Status: Proposed
- Status: Deprecated
---

## Introduction

_This proposal is deprecated in favor of the new [proposal](./ruler-ha-new.md)_

This proposal consolidates multiple existing PRs from the AWS team working on this feature, as well as future work needed to complete support. The hope is that a more holistic view will make for more productive discussion and review of the individual changes, as well as provide better tracking of overall progress.

The original issue is [#4435](https://github.com/cortexproject/cortex/issues/4435).
Expand Down
Loading