Skip to content

Low Level RestClient LLClient, rapidly clear out blacklist when everything was blacklisted but one node is successfully revived #118226

Open
@jonenst

Description

@jonenst

Description

After an event where the whole cluster becomes unreachable, with enough failed requests all nodes are in the blacklist for up to 30 minutes. When the cluster becomes reachable again, one node is revived and all the other nodes stay in the blacklist. This effectively disables client side load balancing for 30 minutes.

I believe attempting to revive only one node is a good strategy because of the strong correlation of request failures when all nodes are in the blacklist (it greatly reduces the number of failures during the downtime). But similarly, successful revives should also be strongly correlated when all nodes were in the blacklist, and there are a lot of potential gain in trying (restoring client side load balancing), so maybe the blacklist should be cleared out rapidly in this case, so that other nodes may be used again. Immediately fully clearing the blacklist could lead to the next request having many failed retries (if only few nodes from the cluster were actually reachable again) so it's probably too dangerous, but maybe a more progressive approach where only one new node is retested at each request and if not available the node that was initially revived is then immediately used (and the tested node is not a candidate again for beeing rapidly removed from the deadlist, at least not until another event of "node revived when all were in the blacklist")? Or maybe a system where the remaining blacklist duration of the node that was revived is subtracted from all other nodes durations, so that if they all became unreachable at around the same time, they would all soon be cleared from the blacklist, but if the cluster was for a long time in a state with only one available node and it becomes unavailable but then soon after becomes available again, then it should be the only one cleared quickly from the blacklist (although with exponential back off, around 10 requests are enough to reach the 30 minutes limit, so this would be very limited; maybe compare the failed attemps per node, even though it grows very rapidly when all nodes are unreachable because it is not longer limited by the blacklist timeout since every request tries to revive one node, it will still keep the original difference as the exponential backoff leads the system to rapidly loadbalance accross all nodes, so it keeps the difference of ~50 failed attempts per day of unavailability)

Note: I reasoned from reading the source code, not from actual events, so i'm not 100% sure this is accurate. See https://github.com/elastic/elasticsearch/blob/main/client/rest/src/main/java/org/elasticsearch/client/RestClient.java#L498-L506

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions