Exponential backoff for nodes in a panic loop #61886
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
S-1-stability
Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
S-1
High impact: many users impacted, serious risk of high unavailability or data loss
T-kv
KV Team
Is your feature request related to a problem? Please describe.
When a node is sick and in a panic loop a single node can destabilize the the whole cluster.
Describe the solution you'd like
It would be nice to have an exponential backoff on sick nodes in the cluster. If the recycle time and frequency was recorded and referenced when a new node joins a cluster, some heuristics could be added to pause before accepting ranges, leases, and connections. Basically, wait some set of time before a node becomes a full member.... something like a "PID controller" for cluster admission.
Additional context
This was seen with #61818 . When a the new binary was deployed to just the most sick node, the cluster became stable pretty quickly.
Jira issue: CRDB-2718
The text was updated successfully, but these errors were encountered: