kv: add throttling for background GC operations based on store health #57248

sumeerbhola · 2020-11-30T20:21:13Z

There is no throttling of GC based on store health, and we've seen situations where removal of the protected timestamp due to cancellation of stuck backups has caused a GC spike and overloaded the LSM store of a few nodes. This required significant manual intervention to restore the cluster to a healthy state, and customer unhappiness.

We should be throttling the proposals generated by the gcQueue based on the health of all the replica stores of a range. There is a concern that too much throttling could itself tip the stores into a different form of unhealthiness, with too many versions of a key. I think it is ok to set the default throttling to allow for moderate overload, like it is for ingestDelayL0Threshold (which is used when adding sstables).

This issue relates to #57247 , which also needs a store health signal for all replicas.

Jira issue: CRDB-2846

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-01-12T14:44:39Z

How much of this would be needed if we get to a point where GC is performed through compaction as opposed to point operations? I ask because making decisions "based on the health of all the replica stores of a range" seems difficult, especially as our primary defense mechanism, seems hard to get right. We not only need to formalize what "health" means in this context, but we also need to deal with staleness of this information. And then we need to worry about potential fallout due to starved GC, as you mention.

I ask all this because I'm interested in how much of this we can push off if we instead make another pass at dramatically reducing the cost of MVCC GC through a change like #42514. Do you have thoughts about that?

sumeerbhola · 2021-01-13T14:42:50Z

How much of this would be needed if we get to a point where GC is performed through compaction as opposed to point operations?

Maybe not needed then, though we will still need a fallback GC path that works as now (though probably won't find much garbage) and will need to run at each replica (so can throttle itself based on local information). But I don't know when it will be worked on. It seems complicated and there are a number of technical details still to be worked out as outlined in #57260

I ask because making decisions "based on the health of all the replica stores of a range" seems difficult, especially as our primary defense mechanism, seems hard to get right. We not only need to formalize what "health" means in this context, but we also need to deal with staleness of this information.

I was expecting something very simple wrt health -- say bytes in L0 and Lbase relative to the rest of the LSM (we would also converge this signal with whatever we currently use for import).
Regarding staleness, what is the staleness for our current gossip? My understanding is that these LSMs get unhealthy over the timescale of many minutes (or longer).

sumeerbhola · 2023-10-03T17:38:17Z

We have replication admission control which applies to elastic traffic (which this is), and better compaction heuristics for *DELs etc., so closing this as done.

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Nov 30, 2020

nvanbenschoten mentioned this issue Apr 23, 2021

kv/sql: GC after TRUNCATE temporarily craters write throughput #62700

Closed

sumeerbhola mentioned this issue Jan 18, 2022

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Closed

irfansharif added the A-admission-control label Sep 16, 2022

sumeerbhola closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: add throttling for background GC operations based on store health #57248

kv: add throttling for background GC operations based on store health #57248

sumeerbhola commented Nov 30, 2020 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Jan 12, 2021

sumeerbhola commented Jan 13, 2021

sumeerbhola commented Oct 3, 2023

kv: add throttling for background GC operations based on store health #57248

kv: add throttling for background GC operations based on store health #57248

Comments

sumeerbhola commented Nov 30, 2020 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Jan 12, 2021

sumeerbhola commented Jan 13, 2021

sumeerbhola commented Oct 3, 2023

sumeerbhola commented Nov 30, 2020 •

edited by cockroach-jira-scripts

Loading