-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
There is no throttling of GC based on store health, and we've seen situations where removal of the protected timestamp due to cancellation of stuck backups has caused a GC spike and overloaded the LSM store of a few nodes. This required significant manual intervention to restore the cluster to a healthy state, and customer unhappiness.
We should be throttling the proposals generated by the gcQueue
based on the health of all the replica stores of a range. There is a concern that too much throttling could itself tip the stores into a different form of unhealthiness, with too many versions of a key. I think it is ok to set the default throttling to allow for moderate overload, like it is for ingestDelayL0Threshold
(which is used when adding sstables).
This issue relates to #57247 , which also needs a store health signal for all replicas.
Jira issue: CRDB-2846