Description
GC of old versions requires reading each of the versions and creating point deletes to the storage layer to individually delete them (we may start using range tombstones in #51230). Since this is costly, we do GC periodically, when the "score", which includes the fraction of the bytes in a replica that are dead, is high enough. The lack of incremental GC means that individual keys can accumulate a lot of garbage, which can affect read performance.
It would be better if most of the GC work happened as a side-effect of storage level compactions which take as input a subset of sstables in the LSM. Periodic GC would still be used to cleanup versions that did not get cleaned up incrementally in compactions.
In #42514, we attempted to prototype GC in compactions.
There are some challenges to achieving this in the higher-level context of CockroachDB:
-
MVCCStats maintenance: MVCCStats include information about non-live bytes and GC in compactions will make these inaccurate. We could do one of the following:
- completely overhaul the MVCCStats machinery
- keep the periodic scan that computes new MVCCStats, and a compaction-GC-timestamp-threshold, and replicate both through Raft (suggested by @ajwerner). The timestamp threshold allows compactions to GC all non-live versions that have timestamp lower than the threshold. Since this is a non-incremental scheme we are still subject to too much garbage accumulation for individual keys, so this may be ok as a first step, but not necessarily as the final solution.
-
We expect all replicas to have identical replicated state in the store: Replica consistency checking is important for discovering bugs, and sometimes mitigating the impact of bugs that may accidentally creep into production deployments. GC in compactions will make the replicas diverge. The previous compaction-GC-timestamp-threshold approach fixes this, in that the read for consistency checking can hide all non-live data below the timestamp threshold. Alternatively, consistency checking could hide all non-live versions altogether.
-
Correctly identifying which values (SET) in the compaction have not been deleted (DEL) in sstables not involved in the compaction. See the discussion thread at [WIP] storage/engine: attempt to do best-effort GC in compactions (fa… #42514 (review) for details. There are two (not fleshed out) approaches:
- make different MVCC versions be overlapping in the file key space. This ensures that if one sees, for example,
a@30#100.SET, a@20#90.SET, a@10#80.SET
in the compaction input, there isn't aa@20#95.DEL
lurking in another sstable that is not part of this compaction. The cons of this approach are (a) the possibility of higher write amplification, and (b) very big sstables (if there are too many versions of a key -- though we should be able to GC most of them, even if there is an old protected timestamp, so this may not be an issue). - pass down a promise (structured as a range) that no such
a
key with timestamp <= 20 exists in a higher sstable (suggested by @nvanbenschoten). This idea needs more refinement, since propagation of the promise would require the key range of the promise to be included in the sstable key range (like range tombstones), which may itself cause unnecessarily wide compactions and higher write amplification. And if the file key range caused by the promise is considered in the read path, we may unnecessarily read an sstable where there is no relevant data, which would increase read amplification.
- make different MVCC versions be overlapping in the file key space. This ensures that if one sees, for example,
Jira issue: CRDB-2843
Epic CRDB-2566