Skip to content

storage: bug in single delete optimization for separated intents #69891

Closed
@sumeerbhola

Description

@sumeerbhola

The SINGLEDEL optimization for separated intents is used for removing the intent when possible. It is important since

  • A combination of SET=>SINGLEDEL (SET followed by/happens before SINGLEDEL) will result in both disappearing from Pebble when they meet in a compaction. In contrast SET=>DEL will cause the SET to disappear, but the DEL will typically fall all the way to L6 before being elided.
  • Unlike interleaved intents, where we reused the same key for intents (<key>@0), separated intents use a different key for each txn. This is done to allow for lock contention reduction in the future. With interleaved intents we had <key>@0.SET=><key>@0.DEL=><key>@0.SET=><key>@0.DEL, so even if the latest DEL had to fall all the way to L6, the older DELs would vanish because of the newer SET (this is potentially important for keys with high write rates). With separated intents and high write rates, where each set and delete pair will be a different key, the SINGLEDEL optimization is important for not resulting in more garbage relative to interleaved intents.

The optimization is performed by tracking a TxnDidNotUpdateMeta bool in the intent's MVCCMetadata proto, which defaults to false for legacy code, and starts as true for non-legacy code. If the MVCCMetadata is updated using another SET, this value transitions to false. When resolving the intent we use SINGLEDEL if this value is true, else DEL. There are two correctness assumptions made here:

  • [A1] The MVCCMetadata for a txn is never deleted and recreated during the lifetime of a transaction. This is important since the history of TxnDidNotUpdateMeta would be lost if it were deleted and recreated.
  • [A2] CockroachDB range drops (due to the range being moved to another node) are done using a RANGEDEL across the whole range and not using individual DELs for the data to be removed. So we can get a sequence of: SET=>RANGEDEL=>SET=>SINGLEDEL if the range is removed before the intent is resolved, then added back, and then the intent is resolved. If the RANGEDEL were replaced by a DEL, we would have the bug described below.

Assumption [A1] is violated, by intent resolution for a non-finalized txn which has (a) rolled back savepoints that cause the intent to be removed, (b) the txn epoch has been incremented and the intent is from an older epoch.
Due to this violation we can see sequences like SET=>SET=>DEL=>SET=>SINGLEDEL.
To understand why this causes a problem, note that compactions only see a continuous subsequence of the sequence of operations on a key — they can be missing operations newer and older. A compaction can see the DEL=>SET and the SET will consume the DEL. The reasoning being that it is either (a) the current latest SET in the LSM in which case the DEL is no longer relevant, or (b) something later has deleted the whole key, in which the DEL is also no longer needed. The SINGLEDEL violates (b). The result of this compaction will cause the LSM to have SET=>SET=>SET=>SINGLEDEL. And if this SET and SINGLEDEL later meet in a compaction, both will vanish. Now all the deleted SETs from before will reappear incorrectly.

List of things we need to do to fix this:

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-transactionsRelating to MVCC and the transactional model.A-storageRelating to our storage engine (Pebble) on-disk storage.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions