Skip to content

Commit

Permalink
sstable: add value blocks for storing older versions of a key
Browse files Browse the repository at this point in the history
When WriterOptions.ValueBlocksAreEnabled is set to true, older
versions of a key are written to a sequence of value blocks, and
the key contains a valueHandle which is a tuple
(valueLen, blockNum, offsetInBlock). The assumption here is
that most reads only need the value of the latest version, and
many reads that care about an older version only need the value
length.

Value blocks are a simple sequence of (varint encoded length,
value bytes) tuples such that given the uncompressed value
block, the valueHandle can cheaply read the value. The value
blocks index connects the blockNum in the valueHandle to the
location of the value block. It uses fixed width encoding to
avoid the expense of a general purpose key-value block.

See the comment at the top of value_block.go for details.

The following are preliminary results from a read benchmark,
after some performance tuning. The old numbers are master.
The needValue=false cases are the ones where value blocks are
expected to help.
- The versions=1 have no values in value blocks, and the slowdown
  is the extra call to valueBlockReader that needs to subslice
  to remove the single byte prefix.
- The hasCache=false case correspond to a cold cache, where there
  will be additional wasted decompression of values that we don't
  need (when needValue=false). As expected, when there is an
  improvement, it is larger with hasCache=false. For example the
  -97.83% below (almost 50x faster) compared with -79.89%.
- The needValue=true is where the code can be slower up to 2x.
  The higher slowdowns occur when the value size is smaller. In
  such cases more inline values can be packed into an ssblock and
  the code overhead of decoding the valueHandle, and the value
  length in the value block (all of these are varints) becomes
  a significant component.

This is a prototype in that there are no changes to the
InternalIterator interface, and the read path only works for
singleLevelIterator.

name                                                                        old time/op    new time/op    delta
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=false-16        25.5ns ± 3%    25.9ns ± 2%   +1.50%  (p=0.028 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=false/hasCache=true-16         15.6ns ± 1%    15.5ns ± 2%     ~     (p=0.268 n=9+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=false-16         27.3ns ± 3%    29.5ns ± 3%   +8.11%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=1/needValue=true/hasCache=true-16          17.1ns ± 2%    19.2ns ± 2%  +12.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=false-16       26.7ns ± 2%    29.4ns ± 2%  +10.46%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=false/hasCache=true-16        15.9ns ± 2%    15.2ns ± 3%   -4.63%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=false-16        26.7ns ± 2%    53.0ns ± 4%  +98.79%  (p=0.000 n=9+10)
ValueBlocks/valueSize=100/versions=10/needValue=true/hasCache=true-16         16.6ns ± 1%    26.7ns ± 2%  +61.05%  (p=0.000 n=9+9)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=false-16      28.3ns ± 4%    25.3ns ± 5%  -10.74%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=false/hasCache=true-16       15.8ns ± 2%    14.9ns ± 1%   -5.66%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=false-16       29.4ns ± 4%    47.8ns ± 3%  +62.46%  (p=0.000 n=10+10)
ValueBlocks/valueSize=100/versions=100/needValue=true/hasCache=true-16        16.7ns ± 4%    26.1ns ± 3%  +56.04%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=false-16        123ns ± 4%     125ns ± 7%     ~     (p=0.735 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=false/hasCache=true-16        23.0ns ± 5%    22.9ns ± 5%     ~     (p=0.684 n=10+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=false-16         124ns ± 6%     131ns ± 7%   +5.76%  (p=0.008 n=9+10)
ValueBlocks/valueSize=1000/versions=1/needValue=true/hasCache=true-16         24.3ns ± 4%    26.4ns ± 3%   +8.26%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=false-16       130ns ± 8%      27ns ± 4%  -79.10%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=false/hasCache=true-16       23.8ns ± 4%    16.6ns ± 2%  -30.00%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=false-16        128ns ± 9%     164ns ±12%  +27.94%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=10/needValue=true/hasCache=true-16        25.0ns ± 4%    33.0ns ± 2%  +32.22%  (p=0.000 n=10+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=false-16      123ns ± 9%      28ns ± 3%  -76.89%  (p=0.000 n=9+10)
ValueBlocks/valueSize=1000/versions=100/needValue=false/hasCache=true-16      23.0ns ± 2%    15.3ns ± 5%  -33.36%  (p=0.000 n=10+9)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=false-16       132ns ± 2%     171ns ± 5%  +29.24%  (p=0.000 n=8+10)
ValueBlocks/valueSize=1000/versions=100/needValue=true/hasCache=true-16       24.3ns ± 3%    32.6ns ± 3%  +33.98%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=false-16      1.45µs ± 8%    1.35µs ±10%   -6.41%  (p=0.015 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=false/hasCache=true-16       75.5ns ± 2%    76.7ns ± 5%     ~     (p=0.218 n=10+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=false-16       1.34µs ± 3%    1.46µs ±16%   +9.03%  (p=0.022 n=9+10)
ValueBlocks/valueSize=10000/versions=1/needValue=true/hasCache=true-16        77.0ns ± 3%    79.9ns ± 3%   +3.80%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=false-16     1.46µs ± 6%    0.13µs ± 3%  -91.15%  (p=0.000 n=9+9)
ValueBlocks/valueSize=10000/versions=10/needValue=false/hasCache=true-16      76.4ns ± 3%    21.4ns ± 2%  -72.06%  (p=0.000 n=10+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=false-16      1.47µs ± 8%    1.56µs ± 7%   +5.72%  (p=0.013 n=9+10)
ValueBlocks/valueSize=10000/versions=10/needValue=true/hasCache=true-16       78.1ns ± 4%    76.1ns ± 2%   -2.52%  (p=0.009 n=10+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=false-16    1.34µs ± 5%    0.03µs ± 2%  -97.83%  (p=0.000 n=9+10)
ValueBlocks/valueSize=10000/versions=100/needValue=false/hasCache=true-16     77.0ns ± 2%    15.5ns ± 2%  -79.89%  (p=0.000 n=8+10)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=false-16     1.42µs ± 9%    1.49µs ± 2%   +5.28%  (p=0.007 n=10+9)
ValueBlocks/valueSize=10000/versions=100/needValue=true/hasCache=true-16      78.5ns ± 4%    73.0ns ± 4%   -7.01%  (p=0.000 n=10+9)

Informs cockroachdb#1170
  • Loading branch information
sumeerbhola committed Jan 6, 2022
1 parent 90d9c97 commit cab59da
Show file tree
Hide file tree
Showing 13 changed files with 1,017 additions and 43 deletions.
4 changes: 4 additions & 0 deletions internal/base/iterator.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ package base

import "fmt"

// TODO(sumeer): change the InternalIterator interface to not eagerly return
// the value. Implementations should continue to cache the value once asked to
// return it, so that repeated calls to get the value are cheap.

// InternalIterator iterates over a DB's key/value pairs in key order. Unlike
// the Iterator interface, the returned keys are InternalKeys composed of the
// user-key, a sequence number and a key kind. In forward iteration, key/value
Expand Down
37 changes: 30 additions & 7 deletions sstable/block.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,15 @@ import (
"github.com/cockroachdb/pebble/internal/cache"
)

// NB: blockWriter supports addInlineValuePrefix for efficiency reasons, in
// that we don't want the caller to have to copy the value to a new slice in
// order to add the prefix. It does not know about valueHandlePrefix since the
// serialization of valueHandle includes that prefix. Similarly, blockReader
// knows nothing about these prefixes since the value read from a block that
// has such prefixes is passed to valueBlockReader for interpretation. A
// cleaner abstraction would remove all knowledge of the prefix from this
// file.

func uvarintLen(v uint32) int {
i := 0
for v >= 0x80 {
Expand All @@ -28,12 +37,13 @@ type blockWriter struct {
buf []byte
restarts []uint32
curKey []byte
curValue []byte
prevKey []byte
tmp [4]byte
// curValue excludes the optional inlineValuePrefix.
curValue []byte
prevKey []byte
tmp [4]byte
}

func (w *blockWriter) store(keySize int, value []byte) {
func (w *blockWriter) store(keySize int, value []byte, addInlineValuePrefix bool) {
shared := 0
if w.nEntries == w.nextRestart {
w.nextRestart = w.nEntries + w.restartInterval
Expand All @@ -58,7 +68,11 @@ func (w *blockWriter) store(keySize int, value []byte) {
}
}

needed := 3*binary.MaxVarintLen32 + len(w.curKey[shared:]) + len(value)
lenValuePlusOptionalPrefix := len(value)
if addInlineValuePrefix {
lenValuePlusOptionalPrefix++
}
needed := 3*binary.MaxVarintLen32 + len(w.curKey[shared:]) + lenValuePlusOptionalPrefix
n := len(w.buf)
if cap(w.buf) < n+needed {
newCap := 2 * cap(w.buf)
Expand Down Expand Up @@ -100,7 +114,7 @@ func (w *blockWriter) store(keySize int, value []byte) {
}

{
x := uint32(len(value))
x := uint32(lenValuePlusOptionalPrefix)
for x >= 0x80 {
w.buf[n] = byte(x) | 0x80
x >>= 7
Expand All @@ -111,6 +125,10 @@ func (w *blockWriter) store(keySize int, value []byte) {
}

n += copy(w.buf[n:], w.curKey[shared:])
if addInlineValuePrefix {
w.buf[n : n+1][0] = byte(inlineValuePrefix)
n++
}
n += copy(w.buf[n:], value)
w.buf = w.buf[:n]

Expand All @@ -120,6 +138,11 @@ func (w *blockWriter) store(keySize int, value []byte) {
}

func (w *blockWriter) add(key InternalKey, value []byte) {
w.addWithOptionalInlineValuePrefix(key, value, false)
}

func (w *blockWriter) addWithOptionalInlineValuePrefix(
key InternalKey, value []byte, addInlineValuePrefix bool) {
w.curKey, w.prevKey = w.prevKey, w.curKey

size := key.Size()
Expand All @@ -129,7 +152,7 @@ func (w *blockWriter) add(key InternalKey, value []byte) {
w.curKey = w.curKey[:size]
key.Encode(w.curKey)

w.store(size, value)
w.store(size, value, addInlineValuePrefix)
}

func (w *blockWriter) finish() []byte {
Expand Down
4 changes: 4 additions & 0 deletions sstable/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,10 @@ type WriterOptions struct {

// Checksum specifies which checksum to use.
Checksum ChecksumType

// ValueBlocksAreEnabled indicates whether the writer should place older
// versions in value blocks.
ValueBlocksAreEnabled bool
}

func (o WriterOptions) ensureDefaults() WriterOptions {
Expand Down
16 changes: 16 additions & 0 deletions sstable/properties.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,10 @@ type Properties struct {
NumRangeKeySets uint64 `prop:"pebble.num.range-key-sets"`
// The number of RANGEKEYUNSETs in this table.
NumRangeKeyUnsets uint64 `prop:"pebble.num.range-key-unsets"`
// The number of value blocks in this table. Only serialized if > 0.
NumValueBlocks uint64 `prop:"pebble.num.value-blocks"`
// The number of values stored in value blocks. Only serialized if > 0.
NumValuesInValueBlocks uint64 `prop:"pebble.num.values.in.value-blocks"`
// Timestamp of the earliest key. 0 if unknown.
OldestKeyTime uint64 `prop:"rocksdb.oldest.key.time"`
// The name of the prefix extractor used in this table. Empty if no prefix
Expand All @@ -142,6 +146,9 @@ type Properties struct {
TopLevelIndexSize uint64 `prop:"rocksdb.top-level.index.size"`
// User collected properties.
UserProperties map[string]string
// True iff the use of value blocks is enabled. Only serialized if true.
ValueBlocksAreEnabled bool `prop:"pebble.value-blocks.enabled"`

// If filtering is enabled, was the filter created on the whole key.
WholeKeyFiltering bool `prop:"rocksdb.block.based.table.whole.key.filtering"`

Expand Down Expand Up @@ -340,6 +347,12 @@ func (p *Properties) save(w *rawBlockWriter) {
p.saveUvarint(m, unsafe.Offsetof(p.RawRangeKeyKeySize), p.RawRangeKeyKeySize)
p.saveUvarint(m, unsafe.Offsetof(p.RawRangeKeyValueSize), p.RawRangeKeyValueSize)
}
if p.NumValueBlocks > 0 {
p.saveUvarint(m, unsafe.Offsetof(p.NumValueBlocks), p.NumValueBlocks)
}
if p.NumValuesInValueBlocks > 0 {
p.saveUvarint(m, unsafe.Offsetof(p.NumValuesInValueBlocks), p.NumValuesInValueBlocks)
}
p.saveUvarint(m, unsafe.Offsetof(p.OldestKeyTime), p.OldestKeyTime)
if p.PrefixExtractorName != "" {
p.saveString(m, unsafe.Offsetof(p.PrefixExtractorName), p.PrefixExtractorName)
Expand All @@ -350,6 +363,9 @@ func (p *Properties) save(w *rawBlockWriter) {
}
p.saveUvarint(m, unsafe.Offsetof(p.RawKeySize), p.RawKeySize)
p.saveUvarint(m, unsafe.Offsetof(p.RawValueSize), p.RawValueSize)
if p.ValueBlocksAreEnabled {
p.saveBool(m, unsafe.Offsetof(p.ValueBlocksAreEnabled), p.ValueBlocksAreEnabled)
}
p.saveBool(m, unsafe.Offsetof(p.WholeKeyFiltering), p.WholeKeyFiltering)

keys := make([]string, 0, len(m))
Expand Down
2 changes: 1 addition & 1 deletion sstable/raw_block.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ func (w *rawBlockWriter) add(key InternalKey, value []byte) {
w.curKey = w.curKey[:size]
copy(w.curKey, key.UserKey)

w.store(size, value)
w.store(size, value, false)
}

// rawBlockIter is an iterator over a single block of data. Unlike blockIter,
Expand Down
Loading

0 comments on commit cab59da

Please sign in to comment.