Rulers HA

**Is your feature request related to a problem? Please describe.**
Currently the Ruler's `ReplicationFactor` is [hardcoded](https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler_ring.go#L99) to 1.

Running each rule groups in a single ruler cause some problems as described below:

* Rules Group Evaluation
  * Noisy neighbor: As rule groups are evaluated by only one Ruler, they can be impacted by a Noisy neighbor placed in the same ruler. The impact can be from delayed rules evaluation (`rule_group_iterations_missed_total` metric) to a complete outage (For instance in case of the ruler OMMing - CrashLoopBackoff)
  * Total/partial hardware Failure: In this case, all rule groups on the impaired hardware will be impacted.

* Rules API:
  * 5XX in case of a [single ruler outage](https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L755)
    * This is a problem not only during a ruler outage but on deployments as well (specially with memberlist) as the ring changes can take some time to propagate.
  * Inconsistent Results:
     * During deployments the rules are resharded and when an existing ruler get assigned to a new set of rule groups, it has to load them: https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L460
     * Request from the time the reshard get  trigged to the time the ruler load the new rules will have inconsistent results.

**Describe the solution you'd like**

In order to achieve HA we could make the `ReplicationFactor` configurable and run each rule group in multiple rulers. Running the same rule group in multiples rulers should not be a problem as all replicas must use the same [slotted intervals](https://github.com/prometheus/prometheus/blob/b878527151e6503d24ac5b667b86e8794eb79ff7/rules/manager.go#L509) to evaluate the rules - in other words, all replicas should use the same timestamps to evaluate the rules and create the metrics.

Below is an example from a POC where 3 rulers are evaluating the same rule group and we can see that the evaluation interval is respected

```
groups:
- name: test
  interval:  3m

  rules:
  - record: alantestTime
    expr: time()
```
<img width="1667" alt="Screen Shot 2021-08-20 at 2 23 30 PM" src="https://user-images.githubusercontent.com/4027760/130294759-4cf39e8f-7ce8-4412-959f-cac43ef0f680.png">
<img width="1644" alt="Screen Shot 2021-08-20 at 2 23 40 PM" src="https://user-images.githubusercontent.com/4027760/130294766-b9136ab9-8390-45f0-9c1b-2f784a4463f7.png">


The problem now is that we have multiple rulers generating the same metrics and so, getting "Duplicated Samples Errors". One possible solution would be to ignore the DuplicatedSamples Error in the Ruler [Pusher](https://github.com/cortexproject/cortex/blob/master/pkg/ruler/compat.go#L75) but doing so, those samples would still being counted on the ingestor [DiscartedSample metric](https://github.com/cortexproject/cortex/blob/master/pkg/ingester/ingester_v2.go#L831) and discovering if the error returned by the ingesters was a "Duplicate Samples Errors" could be challenging on the rulers pusher side - probably a string comparison. I think a better solution would be to make the ingester not throw the "Duplicated Samples Errors" at all **WHEN** the samples received are being sent by a ruler - fortunately we have this information on the ingesters:

https://github.com/cortexproject/cortex/blob/b4daa22055ffec14311d8b5d2d9429f1bd575dad/pkg/ingester/ingester_v2.go#L937-L944

**Describe alternatives you've considered**

One option would be to use the HA tracker and let the distributor dedup the duplicated samples. In this case we could use the pod name as `__replica__` but we cannot have a single value for the cluster label (as it would cause problem if the shard size > replication Factor) - A possible solution for this would be to calculate the cluster value based on all  rulers on the  rule group [replicaSet](https://github.com/cortexproject/cortex/blob/master/pkg/ruler/ruler.go#L409) (ex: for ruleGroupA we have 3 rulers - sort the 3 rulers and use cluster value).


The problem with this approach is that we don't know what was the rule group that generated the metric on the [Pusher](https://github.com/cortexproject/cortex/blob/master/pkg/ruler/compat.go#L70) implementation - Even though a change in cortex to add this info in the ctx could be done. The other drawback of this solution is that we will add the cluster label to the metrics generated by the rules.

Another option would be to use the distributor haTracker component (in this case we would rafactor to make it usable on by other components) to track in the ruler itself who is the leader for a given rulegroup. This solution has the same problem as the previous one - we dont know what ruler group is generating the metric in the Pusher Interface but we would not add the cluster label to the metrics generated by the rules.

**Additional context**
Any other solution i could came up had to do further changes on prometheus and would not bring huge advantages.
Ex: Add the ruleGroup info in the [context](https://github.com/prometheus/prometheus/blob/main/rules/manager.go#L333)

	switch req.Source {
	case cortexpb.RULE:
	db.ingestedRuleSamples.Add(int64(succeededSamplesCount))
	case cortexpb.API:
	fallthrough
	default:
	db.ingestedAPISamples.Add(int64(succeededSamplesCount))
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rulers HA #4435

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rulers HA #4435

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions