-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,kvserver: support lazy value fetching in SimpleMVCCIterator #89119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting! Have we done any early prototyping to see the performance impact of value separation? Do we have a sense of value size distributions in real workloads (e.g. CC?)?
I think we should consider storing MVCCValueHeader
separately from the value as well. That would allow us to inspect e.g. the LocalTimestamp
without loading the entire value, and we may extend its use to other metadata as well (as you recall, we very nearly added an import ID field here with a scan predicate). In that case, we may consider adding in MVCCValueHeader()
here too.
Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/engine.go
line 182 at r1 (raw file):
// underlying storage layer to avoid retrieving the value. // REQUIRES: HasPointAndRange() has returned (true, *). MVCCValueLenAndIsTombstone() (int, bool, error)
nit: I found the MVCC prefix a bit confusing here, and initially thought it might be len(MVCCValue.Value.RawBytes)
. I think I'd prefer removing the prefix to clarify that it returns the same value as ValueLen()
, and return the tombstone if it successfully decodes an MVCC tombstone (silently discarding the decoding error otherwise). Alternatively, consider moving MVCC to ValueLenAndIsMVCCTombstone
.
pkg/storage/engine.go
line 1498 at r1 (raw file):
// returning errors.AssertionFailedf for any violations. The iterator // must be valid. func assertSimpleMVCCIteratorInvariants(iter SimpleMVCCIterator) error {
Let's add some assertions here for the new methods, e.g. ValueLen() == len(UnsafeValue())
and IsTombstone
verification.
pkg/storage/intent_interleaving_iter.go
line 900 at r1 (raw file):
func (i *intentInterleavingIter) MVCCValueLenAndIsTombstone() (int, bool, error) { if i.isCurAtIntentIter() { // Defensive code: Returning the actual length in the error case is not
I don't think I'd bother with this -- it's idiomatic to return zero values for errors, and this could in fact encourage callers to ignore the error.
pkg/storage/point_synthesizing_iter.go
line 656 at r1 (raw file):
val := i.rangeKeys[i.rangeKeysIdx].Value if len(val) != 0 { return 0, false, errors.Errorf("unexpected non-empty value for range key")
This can legally happen if the range tombstone has an MVCCValueHeader.LocalTimestamp
set. We'll have to decode the value to check, but we currently assume they're always range tombstones throughout the code base.
pkg/storage/sst_writer.go
line 62 at r1 (raw file):
// that will subsequently be ingested (e.g. with AddSSTable). // TODO: disable keeping older versions in value blocks, though I think it is // unlikely to have multiple versions in this case anyway.
I don't think we can. This is used e.g. for Raft snapshots and C2C replication, which carry version history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Have we done any early prototyping to see the performance impact of value separation? Do we have a sense of value size distributions in real workloads (e.g. CC?)?
This is a prerequisite to storing older versions in value blocks in the same sstable which was experimented with in cockroachdb/pebble#1443. Note that this is not the full WiscKey-like approach, and was not predicated on the size of the value. But this same interface change is also needed for a WiscKey-like approach. I don't know the value size distribution, but will look into CC later. There is anecdotal evidence that there are enough workloads with a high value-size/key-size ratio.
I think we should consider storing MVCCValueHeader separately from the value as well. That would allow us to inspect e.g. the LocalTimestamp without loading the entire value.
The kv iteration path in a ssblock is heavily optimized, and the prototype needed to be very careful to minimize the overhead of extra metadata. My current plan is to use a single byte split into 3 bits for internal Pebble use, 3 bits for user-defined attributes (excluding the length) like whether it is a tombstone, 2 bits for internal optimizations (like skipping runs of SETs that are different versions of the same prefix. So I am keen to avoid prematurely generalizing.
Do you have some codepaths in mind that only need the LocalTimestamp? I am assuming that for uncertainty checking, the more recent version will typically be the latest version for that key, so will continue to be cheap with value-blocks. When full separation of values (into separate files) even the latest version may be expensive to retrieve, so yes in that case we should keep the MVCCValueHeader close-by. The interface changes to access only the MVCCValueHeader can be done when we get closer to that step.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/storage/engine.go
line 182 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: I found the MVCC prefix a bit confusing here, and initially thought it might be
len(MVCCValue.Value.RawBytes)
. I think I'd prefer removing the prefix to clarify that it returns the same value asValueLen()
, and return the tombstone if it successfully decodes an MVCC tombstone (silently discarding the decoding error otherwise). Alternatively, consider moving MVCC toValueLenAndIsMVCCTombstone
.
Our encoding/decoding story is already differentiated based on whether timestamp is empty or not -- for the former we always use MVCCMetadata
, including MVCCMetadata.RawBytes
, which only recently I realized is also how raft log entries are encoded. Which also means MVCCIterator.ValueProto
is quite useless in its generality -- it can only be used for MVCCMetadata
. So most code has to jump through the process of unmarshaling the MVCCMetadata
, then sticking the MVCCMetadata.RawBytes
into roachpb.Value.RawBytes
, and then call roachpb.Value.GetProto
. All of which happens above the storage package.
I wanted to sidestep this messiness, by saying that this interface is only for versioned/timestamped values, so the caller is asserting that it should be a MVCCValue
(which the caller anyway has to know).
We have precedent for making the caller differentiate in the engine interfaces: Writer
has {ClearMVCC,ClearUnversioned,ClearIntent}. There are also many methods with the term MVCC
in the Writer
interface, so we are simply following that same naming model.
And then we don't need to discard an error (which always make me nervous).
pkg/storage/engine.go
line 1498 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Let's add some assertions here for the new methods, e.g.
ValueLen() == len(UnsafeValue())
andIsTombstone
verification.
Done
pkg/storage/intent_interleaving_iter.go
line 900 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I don't think I'd bother with this -- it's idiomatic to return zero values for errors, and this could in fact encourage callers to ignore the error.
I was being paranoid in light of encountering recent buggy code that makes decisions based on length 0. I agree that it is better to find those bugs.
Done.
pkg/storage/mvcc.go
line 4024 at r1 (raw file):
if !s.atMVCCIter { // Defensive code: Returning the actual length in the error case is not // necessary.
Removed this defensive code
pkg/storage/pebble_iterator.go
line 564 at r1 (raw file):
if err != nil { // Defensive code: Returning the actual length in the error case is not // necessary.
Removed
pkg/storage/point_synthesizing_iter.go
line 656 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This can legally happen if the range tombstone has an
MVCCValueHeader.LocalTimestamp
set. We'll have to decode the value to check, but we currently assume they're always range tombstones throughout the code base.
I realized yesterday that we are using MVCCValue encoding for range keys too. Fixed.
pkg/storage/sst_iterator.go
line 214 at r1 (raw file):
if err != nil { // Defensive code: Returning the actual length in the error case is not // necessary.
Removed
pkg/storage/verifying_iterator.go
line 119 at r1 (raw file):
if err != nil { // Defensive code: Returning the actual length in the error case is not // necessary.
Removed
pkg/storage/sst_writer.go
line 62 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I don't think we can. This is used e.g. for Raft snapshots and C2C replication, which carry version history.
Ack. Removed.
pkg/storage/sst_writer.go
line 78 at r1 (raw file):
// MakeBackupSSTWriter creates a new SSTWriter tailored for backup SSTs which // are typically only ever iterated in their entirety. // TODO: disable keeping older versions in value blocks.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the value size distribution, but will look into CC later. There is anecdotal evidence that there are enough workloads with a high value-size/key-size ratio.
Thanks, I was mostly curious.
Do you have some codepaths in mind that only need the LocalTimestamp?
Unnecessary value decoding in pebbleMVCCScanner
uncertainty checks would be the main one (but haven't looked at it in detail). It would also have been beneficial in the import ID case, since we could avoid decoding a large number of values that do not match the import ID when doing a predicate scan.
the more recent version will typically be the latest version for that key, so will continue to be cheap with value-blocks. When full separation of values (into separate files) even the latest version may be expensive to retrieve, so yes in that case we should keep the MVCCValueHeader close-by. The interface changes to access only the MVCCValueHeader can be done when we get closer to that step.
I see. I think most callers would care about the latest version, so I agree that this doesn't seem useful until we have full value separation.
Reviewed 11 of 11 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/engine.go
line 182 at r1 (raw file):
Previously, sumeerbhola wrote…
Our encoding/decoding story is already differentiated based on whether timestamp is empty or not -- for the former we always use
MVCCMetadata
, includingMVCCMetadata.RawBytes
, which only recently I realized is also how raft log entries are encoded. Which also meansMVCCIterator.ValueProto
is quite useless in its generality -- it can only be used forMVCCMetadata
. So most code has to jump through the process of unmarshaling theMVCCMetadata
, then sticking theMVCCMetadata.RawBytes
intoroachpb.Value.RawBytes
, and then callroachpb.Value.GetProto
. All of which happens above the storage package.I wanted to sidestep this messiness, by saying that this interface is only for versioned/timestamped values, so the caller is asserting that it should be a
MVCCValue
(which the caller anyway has to know).
We have precedent for making the caller differentiate in the engine interfaces:Writer
has {ClearMVCC,ClearUnversioned,ClearIntent}. There are also many methods with the termMVCC
in theWriter
interface, so we are simply following that same naming model.And then we don't need to discard an error (which always make me nervous).
This is fine. Ignoring the error was a bad idea.
pkg/storage/engine.go
line 1584 at r2 (raw file):
} if key.IsValue() { valueLen2, _, err := iter.MVCCValueLenAndIsTombstone()
nit: for completeness, let's assert that when true
, UnsafeValue()
decodes to an MVCC range tombstone, and vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/storage/engine.go
line 1584 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: for completeness, let's assert that when
true
,UnsafeValue()
decodes to an MVCC range tombstone, and vice-versa.
Done
SimpleMVCCIterator already separates the methods to retrieve the key and value, and callers who do not need any information about the value do not call UnsafeValue(). However, there are callers (like stats calculation) who need to know the length of the value and in some cases whether an MVCC value is a tombstone. The MVCCValueLenAndIsTombstone() and ValueLen() methods are introduced for these cases. This is preparation for the near future where retrieving the value in Pebble will have actual additional cost, since the value may be in a different part of the file or a different file. Non-test callers that can use these new interfaces have been modified. There are some todos to modify some non-trivial callers, related to deciding what to GC, and checking sstable conflicts, which will happen in future PRs. When new methods are introduced to retrieve these value attributes in pebble.Iterator, only pebbleIterator.{MVCCValueLenAndIsTombstone, ValueLen} will need to be modified to make use of them. Release note: None
bors r=erikgrinaker |
Build succeeded: |
SimpleMVCCIterator already separates the methods to retrieve the key and value, and callers who do not need any information about the value do not call UnsafeValue(). However, there are callers (like stats calculation) who need to know the length of the value and in some cases whether an MVCC value is a tombstone. The MVCCValueLenAndIsTombstone() and ValueLen() methods are introduced for these cases. This is preparation for the near future where retrieving the value in Pebble will have actual additional cost, since the value may be in a different part of the file or a different file.
Non-test callers that can use these new interfaces have been modified. There are some todos to modify some non-trivial callers, related to deciding what to GC, and checking sstable conflicts, which will happen in future PRs.
When new methods are introduced to retrieve these value attributes in pebble.Iterator, only pebbleIterator.{MVCCValueLenAndIsTombstone, ValueLen} will need to be modified to make use of them.
Informs cockroachdb/pebble#1170
Release note: None