kv: add throttling for background GC operations based on store health #57248
Labels
A-admission-control
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
There is no throttling of GC based on store health, and we've seen situations where removal of the protected timestamp due to cancellation of stuck backups has caused a GC spike and overloaded the LSM store of a few nodes. This required significant manual intervention to restore the cluster to a healthy state, and customer unhappiness.
We should be throttling the proposals generated by the
gcQueue
based on the health of all the replica stores of a range. There is a concern that too much throttling could itself tip the stores into a different form of unhealthiness, with too many versions of a key. I think it is ok to set the default throttling to allow for moderate overload, like it is foringestDelayL0Threshold
(which is used when adding sstables).This issue relates to #57247 , which also needs a store health signal for all replicas.
Jira issue: CRDB-2846
The text was updated successfully, but these errors were encountered: