Skip to content

Ruler API HA #5773

Closed
Closed
@emanlodovice

Description

@emanlodovice

Is your feature request related to a problem? Please describe.
Currently the ReplicationFactor for rulers is hard coded to 1.

Loading each rule group to just 1 ruler presents a problem on the Rules API availability as presented in #4435

Right now, the Ruler return 5XX in the API if there is an outage in at least one ruler instance. I am assuming this is because rulers fail to return a complete list of rule groups to the caller.

Describe the solution you'd like
If a ruler restarts, it will loose the state of the rule groups it is running. By state I mean information like Alerts, Health, EvaluationDuration, LastError, etc. The state only gets set when the rule group evaluates which can be minutes after the ruler starts because it depends on the rule group interval.

While rulers don't have evaluation HA, since rule group states are lost after ruler restart/reshard, we can have Rules API HA by allowing for a higher ReplicationFactor for rulers but only the first ruler will evaluate the rule group and the rest of the replica will just load the rule group for the sake of having the rule group information to respond to API calls. This means that if we have ReplicationFactor set to 3, 3 rulers will have load the rule group but only 1 will evaluate.

On the API handler, we return the rule groups that the ruler is evaluating and the rule groups that ruler loads but NOT evaluating and de-duplicate the resulting list by selecting the rule group information with the latest LastEvaluation value. This way, the rule group information coming from the ruler evaluating the rule group will always be selected, but if that ruler has an outage we can still return the rule groups that it is evaluating because they are loaded by other rulers, but with a blank state.

Sample pseudo code of the idea assuming replication factor set to 3 with AZ awareness enabled:

rule_groups_to_evaluation = []
rule_groups_to_backup = []
for rule_group in rule_groups_from_s3:
    hash = tokenForGroup(rule_group)
   rulers = ring.Get(hash, RingOp)
   if rulers[0].Addr == curInstanceAddr:
       rule_groups_to_evaluation.add(rule_group)
   else if rulers[1].Addr == curInstanceAddr || rulers[2].Addr == curInstanceAddr:
       rule_groups_to_backup.add(rule_group)
function GetRules() {
   // getLocalRules currently exists in cortex and it returns the rules that
   // are evaluating
     rule_groups = getLocalRules()
    for rule_group in rule_groups_to_backup:
        rule_groups.add(rule_group)
    return rule_groups
}


function ListRules() {
     rulers = ring.GetReplicationSet()
    rule_groups = []
    failure_az = set()
    for ruler in rulers:
        client = clientPool.GetClientFor(ruler.Addr)
        states, err = client.GetRules()
        if err != nil:
            failure_az.add(ruler.AZ)
        else:
            rule_groups.join(states)
    if len(failure_az) > 1:
        return err
    remove_duplicates(rule_groups)
    return rule_groups
}

Describe alternatives you've considered
An alternative solution that was considered was to store the state of the rule groups to a persistent storage like an sql database. The rulers will write the state to this database every rule evaluation. Then the Rules API can just read off of this database instead of doing a fan out request to all rulers in the ring.

But the unpredictable nature of alerts in alerting rules could result to huge amount of data written to the database which could negatively affect performance. Also adding a database is a huge commitment could become problematic in the future when we have to adjust our data formats

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/rulesBits & bobs todo with rules and alerts: the ruler, config service etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions