Exponential backoff for nodes in a panic loop #61886

glennfawcett · 2021-03-12T01:36:39Z

Is your feature request related to a problem? Please describe.

When a node is sick and in a panic loop a single node can destabilize the the whole cluster.

Describe the solution you'd like

It would be nice to have an exponential backoff on sick nodes in the cluster. If the recycle time and frequency was recorded and referenced when a new node joins a cluster, some heuristics could be added to pause before accepting ranges, leases, and connections. Basically, wait some set of time before a node becomes a full member.... something like a "PID controller" for cluster admission.

Additional context

This was seen with #61818 . When a the new binary was deployed to just the most sick node, the cluster became stable pretty quickly.

Jira issue: CRDB-2718

glennfawcett · 2021-03-12T18:18:05Z

To expand upon the concept of node health, we could include performance metrics as part of the acceptance criteria. For instance, if the P99 read latency for a given node increases far above other nodes, it should begin shedding leases and replicas. This would provide clusters with a 'least latency lease policy' that would help unhealthy nodes impact the overall health of the cluster from an application's point of view. As an acceptance test, the nodes (CPU, Network, IO) resources should be saturated so as to impact the P99 latency and show observe the lease transfers.
Basically, the goal is to refuse entry of bad-actors to the cluster and reduce workload pressure when latency is high on specific nodes. This is already done with load based range splitting, but should be expanded to include performance criteria

knz · 2021-09-13T15:17:41Z

@lunevalex I disagree on your triage action. It looks to me that the request is for the replica allocator to avoid flapping nodes. That's a replication project, not server.

lunevalex · 2021-09-13T17:31:13Z

Discussed with @knz there are multiple things that could be done here both in the KV and Server components.

github-actions · 2023-08-30T11:09:51Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

jlinder added the T-kv KV Team label Jun 16, 2021

blathers-crl bot added the T-server-and-security DB Server & Security label Sep 2, 2021

github-actions bot added the no-issue-activity label Aug 30, 2023

knz removed T-server-and-security DB Server & Security no-issue-activity labels Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential backoff for nodes in a panic loop #61886

Exponential backoff for nodes in a panic loop #61886

glennfawcett commented Mar 12, 2021 •

edited by cockroach-jira-scripts

Loading

glennfawcett commented Mar 12, 2021

knz commented Sep 13, 2021

lunevalex commented Sep 13, 2021

github-actions bot commented Aug 30, 2023

Exponential backoff for nodes in a panic loop #61886

Exponential backoff for nodes in a panic loop #61886

Comments

glennfawcett commented Mar 12, 2021 • edited by cockroach-jira-scripts Loading

glennfawcett commented Mar 12, 2021

knz commented Sep 13, 2021

lunevalex commented Sep 13, 2021

github-actions bot commented Aug 30, 2023

glennfawcett commented Mar 12, 2021 •

edited by cockroach-jira-scripts

Loading