Description
The SINGLEDEL optimization for separated intents is used for removing the intent when possible. It is important since
- A combination of
SET=>SINGLEDEL
(SET followed by/happens before SINGLEDEL) will result in both disappearing from Pebble when they meet in a compaction. In contrastSET=>DEL
will cause the SET to disappear, but the DEL will typically fall all the way to L6 before being elided. - Unlike interleaved intents, where we reused the same key for intents (
<key>@0
), separated intents use a different key for each txn. This is done to allow for lock contention reduction in the future. With interleaved intents we had<key>@0.SET=><key>@0.DEL=><key>@0.SET=><key>@0.DEL
, so even if the latest DEL had to fall all the way to L6, the older DELs would vanish because of the newer SET (this is potentially important for keys with high write rates). With separated intents and high write rates, where each set and delete pair will be a different key, the SINGLEDEL optimization is important for not resulting in more garbage relative to interleaved intents.
The optimization is performed by tracking a TxnDidNotUpdateMeta
bool in the intent's MVCCMetadata
proto, which defaults to false for legacy code, and starts as true for non-legacy code. If the MVCCMetadata
is updated using another SET, this value transitions to false. When resolving the intent we use SINGLEDEL if this value is true, else DEL. There are two correctness assumptions made here:
- [A1] The
MVCCMetadata
for a txn is never deleted and recreated during the lifetime of a transaction. This is important since the history ofTxnDidNotUpdateMeta
would be lost if it were deleted and recreated. - [A2] CockroachDB range drops (due to the range being moved to another node) are done using a RANGEDEL across the whole range and not using individual DELs for the data to be removed. So we can get a sequence of:
SET=>RANGEDEL=>SET=>SINGLEDEL
if the range is removed before the intent is resolved, then added back, and then the intent is resolved. If the RANGEDEL were replaced by a DEL, we would have the bug described below.
Assumption [A1] is violated, by intent resolution for a non-finalized txn which has (a) rolled back savepoints that cause the intent to be removed, (b) the txn epoch has been incremented and the intent is from an older epoch.
Due to this violation we can see sequences like SET=>SET=>DEL=>SET=>SINGLEDEL
.
To understand why this causes a problem, note that compactions only see a continuous subsequence of the sequence of operations on a key — they can be missing operations newer and older. A compaction can see the DEL=>SET
and the SET will consume the DEL. The reasoning being that it is either (a) the current latest SET in the LSM in which case the DEL is no longer relevant, or (b) something later has deleted the whole key, in which the DEL is also no longer needed. The SINGLEDEL violates (b). The result of this compaction will cause the LSM to have SET=>SET=>SET=>SINGLEDEL
. And if this SET and SINGLEDEL later meet in a compaction, both will vanish. Now all the deleted SETs from before will reappear incorrectly.
List of things we need to do to fix this:
- Pebble unit test to demonstrate above expected behavior @sumeerbhola db: add test demonstrating current SINGLEDEL behavior pebble#1252
- CockroachDB unit test that reproduces bug storage: add randomized test to trigger intent single deletion bug #69902
CockroachDB randomized unit test that reproduces bug under all possible intent resolution scenarios @sumeerbhola- Confirmation that roachtest failures seen in roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [replica inconsistency] #69414 are due to SINGLEDEL, and narrow down what is causing inappropriate use of SINGLEDEL: confirmed based on no failure when reducing the scope of the SINGLEDEL optimization. Workload and code inspection suggested the bug is because of txn epoch being bumped up.
- CockroachDB workaround to narrow use of SINGLEDEL optimization storage: narrow down use of SingleDel to avoid anomalies #69923
- CockroachDB correctness when migrating from 21.1 to 21.2 storage: override MVCCMetadata.TxnDidNotUpdateMeta in mixed version c… #70267
Range movement or any other range related operations may also trigger this bug -- see storage: bug in single delete optimization for separated intents #69891 (comment)- Add KV test that stresses range movement while having unresolved intents, and see if intentInterleavingIter operating over the whole global key space finds any inconsistency.
- Fix the code to use range delete instead of delete for 21.2 beta if the Pebble change is not a release blocker.
- Pebble change to make SINGLEDEL semantics robust to sequences like
SET=>DEL=>SET=>SINGLEDEL
andSET=>SINGLEDEL=>SET=>SINGLEDEL
.- Create issue with solution sketch and correctness proof db: more deterministic SingleDelete semantics for Set pebble#1255
- Code changes @nicktrav