Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential backoff for nodes in a panic loop #61886

Open
glennfawcett opened this issue Mar 12, 2021 · 4 comments
Open

Exponential backoff for nodes in a panic loop #61886

glennfawcett opened this issue Mar 12, 2021 · 4 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team

Comments

@glennfawcett
Copy link

glennfawcett commented Mar 12, 2021

Is your feature request related to a problem? Please describe.

When a node is sick and in a panic loop a single node can destabilize the the whole cluster.

Describe the solution you'd like

It would be nice to have an exponential backoff on sick nodes in the cluster. If the recycle time and frequency was recorded and referenced when a new node joins a cluster, some heuristics could be added to pause before accepting ranges, leases, and connections. Basically, wait some set of time before a node becomes a full member.... something like a "PID controller" for cluster admission.

Additional context

This was seen with #61818 . When a the new binary was deployed to just the most sick node, the cluster became stable pretty quickly.

sick_node_228
sick_node_stabilization

Jira issue: CRDB-2718

@glennfawcett glennfawcett added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Mar 12, 2021
@glennfawcett
Copy link
Author

To expand upon the concept of node health, we could include performance metrics as part of the acceptance criteria. For instance, if the P99 read latency for a given node increases far above other nodes, it should begin shedding leases and replicas. This would provide clusters with a 'least latency lease policy' that would help unhealthy nodes impact the overall health of the cluster from an application's point of view. As an acceptance test, the nodes (CPU, Network, IO) resources should be saturated so as to impact the P99 latency and show observe the lease transfers.
Basically, the goal is to refuse entry of bad-actors to the cluster and reduce workload pressure when latency is high on specific nodes. This is already done with load based range splitting, but should be expanded to include performance criteria

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@blathers-crl blathers-crl bot added the T-server-and-security DB Server & Security label Sep 2, 2021
@knz
Copy link
Contributor

knz commented Sep 13, 2021

@lunevalex I disagree on your triage action. It looks to me that the request is for the replica allocator to avoid flapping nodes. That's a replication project, not server.

@lunevalex
Copy link
Collaborator

Discussed with @knz there are multiple things that could be done here both in the KV and Server components.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team
Projects
None yet
Development

No branches or pull requests

4 participants