-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: separate latest and old MVCC versions for faster reads #1170
Comments
There is some historical discussion in cockroachdb/cockroach#17229 (comment) which explicitly changes the key for older versions. That approach has a more explicit performance tradeoff since one needs to rewrite the existing key-value pair, which makes it less tenable. |
@sumeerbhola — I had a shower thought regarding differentiating between provisional and finalized keys. It's complicated and hopefully could be simplified. Food for thought: At the beginning of a compaction, Pebble needs to decide which keys are final and which might be provisional. It takes the smallest-largest key range of the compaction and queries a CockroachDB-provided callback ClosedTimestamp([start, end]) for the current closed timestamp ct for that key range. This value ct guarantees that every key within the range [start, end] with timestamp <ct, is either committed or shadowed by a tombstone somewhere in the LSM. But those tombstones may be somewhere above the compaction, meaning the compaction may still contain provisional keys at timestamps <ct. To differentiate those keys, we introduce coordination from all batches that remove or revise provisional values. history-rewrite rangesWhen a batch is going to alter history by deleting or modifying provisional values, the caller flips a bit on the batch indicating so. The min and max MVCC timestamps written by the batch form the batch's 'history-rewrite range'. Every sstable and memtable has a 'history-rewrite range' that encodes a start and end MVCC timestamp. When the batch commits, its 'history-rewrite range' is merged with the memtable's, widening the range to encompass the existing range and the new batch's range. When a memtable is flushed, the outputted sstables inherit the memtable's history-rewrite range. Some of the sstables may not contain any of the history-rewriting keys, but they still inherit the whole range. [Maybe this could be relaxed to be the intersection of the sstable's MVCC key range and the memtable's history-rewrite range?] When sstables are merged in a compaction, the compaction decides an output's history-rewrite range by merging the history-rewrite ranges of the compaction inputs' that overlap with it. This isn't necessarily all the inputs to the compaction, just the inputs that might have contributed keys to the individual output file. The ranges are merged by widening the range to encompass all the inputted history-rewrite ranges. When a compaction outputs a table to L6, it elides the history-rewrite range property. compaction closed rangesNow, when a compaction starts it needs to decide which keys are final and which are provisional. It takes the input sstable boundaries of all its inputs and fragments them. For each fragment's bounds, it calls ClosedTimestamp([start, end]). It associates with the fragment the range [0,ClosedTimestamp], representing the assumption that all its input keys that fall within the fragment and have timestamps in the range [0,@ClosedTimestamp] are final:
Now, it ascends the LSM level-by-level looking at files and memtables that overlap the compaction's keyspace. It uses these overlapping files' ranges to reduce which timestamp ranges are considered closed. For every overlapping higher-levelled file or memtable, it subtracts out the file or memtable's history-range for overlapping input ranges. This yields a range, or sequence of ranges, in which keys are known to be finalized within each of these intervals of the compaction.
The compaction can consider final any keys that fall within interval i0 and have timestamps within [0, t6], and so on. minimum closed timestampsA long-running transaction in one part of the keyspace doesn't prevent keys from being considered closed in another part... until the long-running transaction gets aborted. When the history-rewriting batch is written to the memtable with the range [tlong ago, trecent], it prevents any keys overlapping the memtable and within that wide time range from being considered closed, even if they were considered closed in earlier compactions. In the above example, a history-rewriting batch in a disjoint part of the keyspace could commit a rewrite batch with a rewrite range of [t1,t9]. If i0, i1, or i2 overlap the memtable, they're forced to only consider keys within [0,t1) to be closed. To avoid this, every sstable also has a 'minimum closed timestamp' property. This represents a floor on the file's closed timestamp. It guarantees any of the file's keys with timestamps less than the minimum closed timestamp are closed. The value is derived from the compaction input intervals. When an sstable is outputted, its key range overlaps some set of the fragmented compaction input intervals. The minimum closed timestamp holds the minimum upper bound t such that [0, t] is contained within all the overlapping input intervals' closed timestamp ranges. In the above example, an output that spans i0 and i1 would have a minimum closed timestamp of min(t6,t8) = t6. When computing closed ranges for a compaction, each compaction input interval is assigned a minimum closed timestamp equal to the minimum of all the overlapping input sstables' minimum closed timestamps.
When traversing the LSM and subtracting out overlapping files' history-rewrite ranges, the interval's minimum closed timestamp prevents the removal of the range below the minimum closed timestamp.:
In the above example, interval i2's minimum closed timestamp of t7 prevents the L2 table m's rewrite range of [t5,t5] from reducing the size of its closed timestamp range. The minimum closed timestamp serves as a compact way of carrying forward knowledge of keys' finality from compaction to compaction. |
Thanks for the writeup @jbowens Another idea, that came up in the discussion in cockroachdb/cockroach#72121 (comment) Consides sstables with separate sections for keys and values. The simplest way to do this is to write
which ensures that the normal sstable sorting takes care of the ordering. It does require buffering the values in-memory (as compressed ssblocks) since they can't be written out immediately. Alternatively, one could reverse the order and put values before keys, in case the buffering required for keys is less. The downside of this scheme is that reading a particular ki=>vi requires loading 2 ssblocks. So instead of doing this for all keys, say we did this for only older versions of keys in the sstable. Something like:
All the keys are visible as usual and the values can be lazily loaded if there is indirection. If we get unlucky and k0@10 has been deleted at a higher level, say because the transaction was aborted, there is no correctness issue since we can load the value for k0@9. Assuming such aborts (or pushes causing rewrites) are rare, we will mostly not need to do this. Scans will typically want to step from k0@10 to k1@50, and won't need to pay the expense of reading the older values of k0. And even if the key lengths are not significantly smaller than value lengths, we will get prefix compression on the different versions of the same key. |
@sumeerbhola Nice, that's cool. I'd briefly entertained storing values out of band when thinking about range keys because of the duplication incurred from fragmentation. I didn't entertain it for long since range keys are rare. It makes a lot of sense in this case.
Now I'm wondering whether you could make use of the prefix compression to avoid the extra key comparisons too. Something like the caller calls |
@jbowens nice idea for avoiding the key comparison -- it seems very doable. Adding an offline comment by @nvanbenschoten on how queue-like workloads could benefit from this:
|
@jbowens pointed out that this organization may also help during GC of older versions. We do two scans for GC
A different issue: not all keys with a non-empty suffix are true CockroachDB MVCC keys. Lock table keys are the exception, where the ordering on the suffix is random (contains the txn UUID). One easy way to accommodate this would be for the Pebble DB |
For the case where there's a lot of garbage like the naive queue implementation, could we use a block property collector, my favorite new Pebble feature? We could write a block property collector that records the prefix cardinality of a block/index-block/sstable. If a block/index-block/sstable has cardinality 1, a |
If we did compaction-time GC, does the encoding of a key still need to be stable once it's committed? I'm imagining modifying committed MVCC keys to set a bit indicating that they are committed. This could be done through a new special range key that associates a key span (always of whole prefixes) with both a closed timestamp and a GC threshold timestamp. These range keys descend the LSM and respect the LSM invariant with respect to keys older than their closed timestamp. In a compaction with one of these range keys, a user-defined merger is invoked for every point key within the range key's span.
MVCCStats would be updated when these range keys are written. The range keys would be elided out of L6. With this scheme:
One obstacle that immediately comes to mind is that the above scheme changes a key's encoding once it's merged with one of these range keys. That would be a problem with today's GC scheme, but performing compaction-time GC sidesteps that. I'm probably unaware of other problems with unstable mvcc keys. |
When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) Informs cockroachdb#1170
When WriterOptions.ValueBlocksAreEnabled is set to true, older versions of a key are written to a sequence of value blocks, and the key contains a valueHandle which is a tuple (valueLen, blockNum, offsetInBlock). The assumption here is that most reads only need the value of the latest version, and many reads that care about an older version only need the value length. Value blocks are a simple sequence of (varint encoded length, value bytes) tuples such that given the uncompressed value block, the valueHandle can cheaply read the value. The value blocks index connects the blockNum in the valueHandle to the location of the value block. It uses fixed width encoding to avoid the expense of a general purpose key-value block. See the comment at the top of value_block.go for details. The following are preliminary results from a read benchmark, after some performance tuning. The old numbers are master. The needValue=false cases are the ones where value blocks are expected to help. - The versions=1 have no values in value blocks, and the slowdown is the extra call to valueBlockReader that needs to subslice to remove the single byte prefix. - The hasCache=false case correspond to a cold cache, where there will be additional wasted decompression of values that we don't need (when needValue=false). As expected, when there is an improvement, it is larger with hasCache=false. For example the -97.83% below (almost 50x faster) compared with -79.89%. - The needValue=true is where the code can be slower up to 2x. The higher slowdowns occur when the value size is smaller. In such cases more inline values can be packed into an ssblock and the code overhead of decoding the valueHandle, and the value length in the value block (all of these are varints) becomes a significant component. This is a prototype in that there are no changes to the InternalIterator interface, and the read path only works for singleLevelIterator. name old time/op new time/op delta ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16 25.5ns ± 3% 25.9ns ± 2% +1.50% (p=0.028 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16 15.6ns ± 1% 15.5ns ± 2% ~ (p=0.268 n=9+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16 27.3ns ± 3% 29.5ns ± 3% +8.11% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16 17.1ns ± 2% 19.2ns ± 2% +12.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16 26.7ns ± 2% 29.4ns ± 2% +10.46% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16 15.9ns ± 2% 15.2ns ± 3% -4.63% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16 26.7ns ± 2% 53.0ns ± 4% +98.79% (p=0.000 n=9+10) ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16 16.6ns ± 1% 26.7ns ± 2% +61.05% (p=0.000 n=9+9) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16 28.3ns ± 4% 25.3ns ± 5% -10.74% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16 15.8ns ± 2% 14.9ns ± 1% -5.66% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16 29.4ns ± 4% 47.8ns ± 3% +62.46% (p=0.000 n=10+10) ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16 16.7ns ± 4% 26.1ns ± 3% +56.04% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16 123ns ± 4% 125ns ± 7% ~ (p=0.735 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16 23.0ns ± 5% 22.9ns ± 5% ~ (p=0.684 n=10+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16 124ns ± 6% 131ns ± 7% +5.76% (p=0.008 n=9+10) ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16 24.3ns ± 4% 26.4ns ± 3% +8.26% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16 130ns ± 8% 27ns ± 4% -79.10% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16 23.8ns ± 4% 16.6ns ± 2% -30.00% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16 128ns ± 9% 164ns ±12% +27.94% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16 25.0ns ± 4% 33.0ns ± 2% +32.22% (p=0.000 n=10+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16 123ns ± 9% 28ns ± 3% -76.89% (p=0.000 n=9+10) ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16 23.0ns ± 2% 15.3ns ± 5% -33.36% (p=0.000 n=10+9) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16 132ns ± 2% 171ns ± 5% +29.24% (p=0.000 n=8+10) ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16 24.3ns ± 3% 32.6ns ± 3% +33.98% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16 1.45µs ± 8% 1.35µs ±10% -6.41% (p=0.015 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16 75.5ns ± 2% 76.7ns ± 5% ~ (p=0.218 n=10+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16 1.34µs ± 3% 1.46µs ±16% +9.03% (p=0.022 n=9+10) ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16 77.0ns ± 3% 79.9ns ± 3% +3.80% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16 1.46µs ± 6% 0.13µs ± 3% -91.15% (p=0.000 n=9+9) ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16 76.4ns ± 3% 21.4ns ± 2% -72.06% (p=0.000 n=10+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16 1.47µs ± 8% 1.56µs ± 7% +5.72% (p=0.013 n=9+10) ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16 78.1ns ± 4% 76.1ns ± 2% -2.52% (p=0.009 n=10+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16 1.34µs ± 5% 0.03µs ± 2% -97.83% (p=0.000 n=9+10) ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16 77.0ns ± 2% 15.5ns ± 2% -79.89% (p=0.000 n=8+10) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16 1.42µs ± 9% 1.49µs ± 2% +5.28% (p=0.007 n=10+9) ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16 78.5ns ± 4% 73.0ns ± 4% -7.01% (p=0.000 n=10+9) To put these numbers in context of what fraction of bytes in an sstable are in regular data blocks (the rest are in value blocks): value-size: 100,versions: 1,data-block-bytes: 1185521,total-bytes: 1186984,fraction-in-data-blocks:1.00 value-size: 100,versions: 10,data-block-bytes: 174822,total-bytes: 1085218,fraction-in-data-blocks:0.16 value-size: 100,versions:100,data-block-bytes: 82118,total-bytes: 1083393,fraction-in-data-blocks:0.08 value-size: 1000,versions: 1,data-block-bytes: 10197788,total-bytes: 10203513,fraction-in-data-blocks:1.00 value-size: 1000,versions: 10,data-block-bytes: 1199435,total-bytes: 10222244,fraction-in-data-blocks:0.12 value-size: 1000,versions:100,data-block-bytes: 169931,total-bytes: 10094442,fraction-in-data-blocks:0.02 value-size:10000,versions: 1,data-block-bytes:100246534,total-bytes:100292713,fraction-in-data-blocks:1.00 value-size:10000,versions: 10,data-block-bytes: 10204053,total-bytes:100256921,fraction-in-data-blocks:0.10 value-size:10000,versions:100,data-block-bytes: 1198846,total-bytes:100252245,fraction-in-data-blocks:0.01 Informs cockroachdb#1170
I've been thinking about this issue in the context of MVCCDeleteRange. A span that includes a MVCCDeleteRange will suffer a performance cliff, because an iterator seeking into the deleted span needs to construct a range-key iterator stack and perform extra key comparisons during interleaving. If a MVCCDeleteRange falls into L6, we could output it and the point keys it covers into a garbage-only L7. Iterators reading at recent timestamps could skip the entire file. This particular case is easier than the general garbage case, since there are no provisional MVCCDeleteRanges. |
It's possible that these will exist in the future (along with ranged intents), if SQL wants to start making use of them. But there are no short-term plans for this. Also, Pebble can't know whether a range key is an MVCCDeleteRange or some other ranged key type, so CRDB will have to inform Pebble about these semantics. Although I suppose any kind of MVCC range key would imply that points below it are garbage. We discussed something related over in cockroachdb/cockroach#57260 (comment), where CRDB can lay down a new kind of range key during GC that it can use in compaction filters to do MVCC GC during Pebble compactions. |
Yeah, I think we should do this regardless when we introduce any other ranged key type, so that we can explicitly avoid applying range-key masking using non-MVCCDeleteRange range keys. |
The GC scan to identify garbage does not need the value of MVCC data, and only needs the length and whether the value is a tombstone. This change switches it to use ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which passes through to the underlying MVCCIterator. Informs cockroachdb/pebble#1170 Release note: None
89522: gc: switch gc to use MVCCValueLenAndIsTombstone r=nvanbenschoten,erikgrinaker a=sumeerbhola The GC scan to identify garbage does not need the value of MVCC data, and only needs the length and whether the value is a tombstone. This change switches it to use ReplicaMVCCDataIterator.MVCCValueLenAndIsTombstone which passes through to the underlying MVCCIterator. Informs cockroachdb/pebble#1170 Release note: None 89587: sqlproxyccl: validate tenant certs common name and org r=JeffSwenson a=JeffSwenson Currently the sqlproxy only validates the tenant cert's DNS name. The tenant cert also contains a specific common name and organization. Validating the common name and organization is useful, because the dns name for k8s pod is the pod's IP address and IP addresses are reused by the cluster. Fixes: [CC-7532](https://cockroachlabs.atlassian.net/browse/CC-7532) Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Jeff <swenson@cockroachlabs.com>
BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs cockroachdb#1170
BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the benchmark now does an iterator step per benchmark iteration instead of a full scan -- this makes the benchmark faster and allows us to compare performance metrics across benchmark parameters (since the metrics do not vary based on the total number of different key prefixes in the LSM). There is a tiny tweak to the field ordering in LazyFetcher and additional code comments about performance and future optimization. Informs cockroachdb#1170
BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs cockroachdb#1170
BenchmarkIteratorScanManyVersions, which iterates over all the versions, now also runs with readValue=true. Similarly, for BenchmarkIteratorScanNextPrefix, which iterates over the most recent versions. Informs #1170
BenchmarkIteratorScanNextPrefix now exercises value blocks. Also, the benchmark now does an iterator step per benchmark iteration instead of a full scan -- this makes the benchmark faster and allows us to compare performance metrics across benchmark parameters (since the metrics do not vary based on the total number of different key prefixes in the LSM). There is a tiny tweak to the field ordering in LazyFetcher and additional code comments about performance and future optimization. Informs #1170
execinfrapb.ScanStats and execstats.ScanStats have these new fields. The intention here is for traces to show these stats and not to complete the plumbing to expose these in the fingerprint stats table. Informs cockroachdb/pebble#1170 Epic: CRDB-20378 Release note: None
96825: pkg/cloud: Add implicit authentication to Azure Storage & KMS r=benbardin a=benbardin This enables users to authenticate to Azure Storage and KeyVault with the Azure Default Credential, described here: https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication. This supports environmental variable authentication, and also authentication via managed identity if CRDB is running on an Azure platform. The Azure documentation describes which environment variables to set (Tenant ID, Client ID, Client Secret) for RBAC. Once selected, appropriate permissions must still be granted to the authenticating Client to use requested Azure resources. These permissions are described in #96459. Release note (enterprise change): Add support for implicit authentication to Azure Storage and KMS using Azure RBAC. Informs: #96972 97182: storage,sql: add separated value iteration stats to ScanStats r=ericharmeling a=sumeerbhola execinfrapb.ScanStats and execstats.ScanStats have these new fields. The intention here is for traces to show these stats and not to complete the plumbing to expose these in the fingerprint stats table. Informs cockroachdb/pebble#1170 Epic: CRDB-20378 Release note: None 97186: server: reduce logs from pgwire cancel r=erikgrinaker a=rafiss fixes #91386 Now we avoid logging a full stack trace, and also only log if the rate limit was exceeded. This is an indication that someone may be maliciously spamming the query cancel protocol. Release note: None 97191: sqlsmith: skip crdb_internal.fingerprint r=mgartner a=mgartner `crdb_internal.fingerprint` is a recently added builtin function that produces internal errors for some valid inputs. This commit adds it to the sqlsmith skip list until it is fixed. Informs #97097 Epic: None Release note: None 97202: roachtest: add flaky test to activerecord ignore list r=rafiss a=andyyang890 Fixes #97163 Release note: None Co-authored-by: Ben Bardin <bardin@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com>
The remaining work is to produce a tech note describing the design. |
Something that didn't click for me until I was talking with @itsbilal just now: The |
Are we restoring to a point in time, so it could be an arbitrary version, and not necessarily the most recent one? The optimization does seem worthwhile. |
Yeah, I think it's possible to restore to a particular point-in-time that might be behind the most recent version of a key. I suppose at transform time, we could use the block properties (do we write these to backups today?) to determine whether the restore timestamp is higher than all the keys within the block. I think it'll be relatively uncommon to need to restore to a non-most-recent version, because only the full backup will be restored "lazily" and require transformation. |
This issue is different from #847 since this one is specifically about how Pebble is used in CockroachDB to store MVCC data. When there are many MVCC versions for a key, reads slow down because (a) the block cache is less effective since most of each block is data that is not needed, (b) when doing scans one needs to seek from one versioned key to another to skip iterating over all the versions.
A prerequisite to doing something here is Pebble gaining some understanding of MVCC timestamps. That prerequisite is also the case for compaction-time GC cockroachdb/cockroach#57260. One can also argue that compaction-time GC may make separating the latest and old MVCC versions less important.
If we ignore provisional values, a rough approach would be to put “live” and “dead” versions of data into separate files. This should not be hard when the multiple versions are in the same sstable -- when generating that sstable instead of writing one sstable we would write a pair, the “live” and “dead” sstable. These will have overlapping keys and will participate in compactions as a pair. A read at the “current time” can ignore the "dead" sstable. But what is “current time” given that we don’t have TrueTime like spanner? One possibility is to track the newest timestamp in the “live” sstable -- if the txn timestamp is newer than this newest timestamp there is no possibility that it will need anything from the “dead” sstable. This should work for sstables in lower levels which have most of the data.
The above approach will need adjustment to handle provisional values and intents.
The text was updated successfully, but these errors were encountered: