Low Level RestClient LLClient, rapidly clear out blacklist when everything was blacklisted but one node is successfully revived

### Description

After an event where the whole cluster becomes unreachable, with enough failed requests all nodes are in the blacklist for up to 30 minutes. When the cluster becomes reachable again, one node is revived and all the other nodes stay in the blacklist. This effectively disables client side load balancing for 30 minutes.

I believe attempting to revive only one node is a good strategy because of the strong correlation of request failures when all nodes are in the blacklist (it greatly reduces the number of failures during the downtime). But similarly, successful revives should also be strongly correlated when all nodes were in the blacklist, and there are a lot of potential gain in trying (restoring client side load balancing), so maybe the blacklist should be cleared out rapidly in this case, so that other nodes may be used again. Immediately fully clearing the blacklist could lead to the next request having many failed retries (if only few nodes from the cluster were actually reachable again) so it's probably too dangerous, but maybe a more progressive approach where only one new node is retested at each request and if not available the node that was initially revived is then immediately used (and the tested node is not a candidate again for beeing rapidly removed from the deadlist, at least not until another event of "node revived when all were in the blacklist")? Or maybe a system where the remaining blacklist duration of the node that was revived is subtracted from all other nodes durations, so that if they all became unreachable at around the same time, they would all soon be cleared from the blacklist, but if the cluster was for a long time in a state with only one available node and it becomes unavailable but then soon after becomes available again, then it should be the only one cleared quickly from the blacklist (although with exponential back off, around 10 requests are enough to reach the 30 minutes limit, so this would be very limited; maybe compare the failed attemps per node, even though it grows very rapidly when all nodes are unreachable because it is not longer limited by the blacklist timeout since every request tries to revive one node, it will still keep the original difference as the exponential backoff leads the system to rapidly loadbalance accross all nodes, so it keeps the difference of ~50 failed attempts per day of unavailability) 

Note: I reasoned from reading the source code, not from actual events, so i'm not 100% sure this is accurate. See https://github.com/elastic/elasticsearch/blob/main/client/rest/src/main/java/org/elasticsearch/client/RestClient.java#L498-L506 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Low Level RestClient LLClient, rapidly clear out blacklist when everything was blacklisted but one node is successfully revived #118226

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Low Level RestClient LLClient, rapidly clear out blacklist when everything was blacklisted but one node is successfully revived #118226

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions