Skip to content

kv: add throttling for background GC operations based on store health #57248

@sumeerbhola

Description

@sumeerbhola

There is no throttling of GC based on store health, and we've seen situations where removal of the protected timestamp due to cancellation of stuck backups has caused a GC spike and overloaded the LSM store of a few nodes. This required significant manual intervention to restore the cluster to a healthy state, and customer unhappiness.

We should be throttling the proposals generated by the gcQueue based on the health of all the replica stores of a range. There is a concern that too much throttling could itself tip the stores into a different form of unhealthiness, with too many versions of a key. I think it is ok to set the default throttling to allow for moderate overload, like it is for ingestDelayL0Threshold (which is used when adding sstables).

This issue relates to #57247 , which also needs a store health signal for all replicas.

Jira issue: CRDB-2846

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-admission-controlA-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions