Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kvserver: support lazy value fetching in SimpleMVCCIterator #89119

Merged
merged 1 commit into from
Oct 5, 2022

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Sep 30, 2022

SimpleMVCCIterator already separates the methods to retrieve the key and value, and callers who do not need any information about the value do not call UnsafeValue(). However, there are callers (like stats calculation) who need to know the length of the value and in some cases whether an MVCC value is a tombstone. The MVCCValueLenAndIsTombstone() and ValueLen() methods are introduced for these cases. This is preparation for the near future where retrieving the value in Pebble will have actual additional cost, since the value may be in a different part of the file or a different file.

Non-test callers that can use these new interfaces have been modified. There are some todos to modify some non-trivial callers, related to deciding what to GC, and checking sstable conflicts, which will happen in future PRs.

When new methods are introduced to retrieve these value attributes in pebble.Iterator, only pebbleIterator.{MVCCValueLenAndIsTombstone, ValueLen} will need to be modified to make use of them.

Informs cockroachdb/pebble#1170

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola sumeerbhola marked this pull request as ready for review October 3, 2022 20:40
@sumeerbhola sumeerbhola requested review from a team as code owners October 3, 2022 20:40
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting! Have we done any early prototyping to see the performance impact of value separation? Do we have a sense of value size distributions in real workloads (e.g. CC?)?

I think we should consider storing MVCCValueHeader separately from the value as well. That would allow us to inspect e.g. the LocalTimestamp without loading the entire value, and we may extend its use to other metadata as well (as you recall, we very nearly added an import ID field here with a scan predicate). In that case, we may consider adding in MVCCValueHeader() here too.

Reviewed 16 of 16 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/engine.go line 182 at r1 (raw file):

	// underlying storage layer to avoid retrieving the value.
	// REQUIRES: HasPointAndRange() has returned (true, *).
	MVCCValueLenAndIsTombstone() (int, bool, error)

nit: I found the MVCC prefix a bit confusing here, and initially thought it might be len(MVCCValue.Value.RawBytes). I think I'd prefer removing the prefix to clarify that it returns the same value as ValueLen(), and return the tombstone if it successfully decodes an MVCC tombstone (silently discarding the decoding error otherwise). Alternatively, consider moving MVCC to ValueLenAndIsMVCCTombstone.


pkg/storage/engine.go line 1498 at r1 (raw file):

// returning errors.AssertionFailedf for any violations. The iterator
// must be valid.
func assertSimpleMVCCIteratorInvariants(iter SimpleMVCCIterator) error {

Let's add some assertions here for the new methods, e.g. ValueLen() == len(UnsafeValue()) and IsTombstone verification.


pkg/storage/intent_interleaving_iter.go line 900 at r1 (raw file):

func (i *intentInterleavingIter) MVCCValueLenAndIsTombstone() (int, bool, error) {
	if i.isCurAtIntentIter() {
		// Defensive code: Returning the actual length in the error case is not

I don't think I'd bother with this -- it's idiomatic to return zero values for errors, and this could in fact encourage callers to ignore the error.


pkg/storage/point_synthesizing_iter.go line 656 at r1 (raw file):

	val := i.rangeKeys[i.rangeKeysIdx].Value
	if len(val) != 0 {
		return 0, false, errors.Errorf("unexpected non-empty value for range key")

This can legally happen if the range tombstone has an MVCCValueHeader.LocalTimestamp set. We'll have to decode the value to check, but we currently assume they're always range tombstones throughout the code base.


pkg/storage/sst_writer.go line 62 at r1 (raw file):

// that will subsequently be ingested (e.g. with AddSSTable).
// TODO: disable keeping older versions in value blocks, though I think it is
// unlikely to have multiple versions in this case anyway.

I don't think we can. This is used e.g. for Raft snapshots and C2C replication, which carry version history.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Have we done any early prototyping to see the performance impact of value separation? Do we have a sense of value size distributions in real workloads (e.g. CC?)?

This is a prerequisite to storing older versions in value blocks in the same sstable which was experimented with in cockroachdb/pebble#1443. Note that this is not the full WiscKey-like approach, and was not predicated on the size of the value. But this same interface change is also needed for a WiscKey-like approach. I don't know the value size distribution, but will look into CC later. There is anecdotal evidence that there are enough workloads with a high value-size/key-size ratio.

I think we should consider storing MVCCValueHeader separately from the value as well. That would allow us to inspect e.g. the LocalTimestamp without loading the entire value.

The kv iteration path in a ssblock is heavily optimized, and the prototype needed to be very careful to minimize the overhead of extra metadata. My current plan is to use a single byte split into 3 bits for internal Pebble use, 3 bits for user-defined attributes (excluding the length) like whether it is a tombstone, 2 bits for internal optimizations (like skipping runs of SETs that are different versions of the same prefix. So I am keen to avoid prematurely generalizing.
Do you have some codepaths in mind that only need the LocalTimestamp? I am assuming that for uncertainty checking, the more recent version will typically be the latest version for that key, so will continue to be cheap with value-blocks. When full separation of values (into separate files) even the latest version may be expensive to retrieve, so yes in that case we should keep the MVCCValueHeader close-by. The interface changes to access only the MVCCValueHeader can be done when we get closer to that step.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/storage/engine.go line 182 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: I found the MVCC prefix a bit confusing here, and initially thought it might be len(MVCCValue.Value.RawBytes). I think I'd prefer removing the prefix to clarify that it returns the same value as ValueLen(), and return the tombstone if it successfully decodes an MVCC tombstone (silently discarding the decoding error otherwise). Alternatively, consider moving MVCC to ValueLenAndIsMVCCTombstone.

Our encoding/decoding story is already differentiated based on whether timestamp is empty or not -- for the former we always use MVCCMetadata, including MVCCMetadata.RawBytes, which only recently I realized is also how raft log entries are encoded. Which also means MVCCIterator.ValueProto is quite useless in its generality -- it can only be used for MVCCMetadata. So most code has to jump through the process of unmarshaling the MVCCMetadata, then sticking the MVCCMetadata.RawBytes into roachpb.Value.RawBytes, and then call roachpb.Value.GetProto. All of which happens above the storage package.

I wanted to sidestep this messiness, by saying that this interface is only for versioned/timestamped values, so the caller is asserting that it should be a MVCCValue (which the caller anyway has to know).
We have precedent for making the caller differentiate in the engine interfaces: Writer has {ClearMVCC,ClearUnversioned,ClearIntent}. There are also many methods with the term MVCC in the Writer interface, so we are simply following that same naming model.

And then we don't need to discard an error (which always make me nervous).


pkg/storage/engine.go line 1498 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Let's add some assertions here for the new methods, e.g. ValueLen() == len(UnsafeValue()) and IsTombstone verification.

Done


pkg/storage/intent_interleaving_iter.go line 900 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I don't think I'd bother with this -- it's idiomatic to return zero values for errors, and this could in fact encourage callers to ignore the error.

I was being paranoid in light of encountering recent buggy code that makes decisions based on length 0. I agree that it is better to find those bugs.
Done.


pkg/storage/mvcc.go line 4024 at r1 (raw file):

	if !s.atMVCCIter {
		// Defensive code: Returning the actual length in the error case is not
		// necessary.

Removed this defensive code


pkg/storage/pebble_iterator.go line 564 at r1 (raw file):

	if err != nil {
		// Defensive code: Returning the actual length in the error case is not
		// necessary.

Removed


pkg/storage/point_synthesizing_iter.go line 656 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This can legally happen if the range tombstone has an MVCCValueHeader.LocalTimestamp set. We'll have to decode the value to check, but we currently assume they're always range tombstones throughout the code base.

I realized yesterday that we are using MVCCValue encoding for range keys too. Fixed.


pkg/storage/sst_iterator.go line 214 at r1 (raw file):

	if err != nil {
		// Defensive code: Returning the actual length in the error case is not
		// necessary.

Removed


pkg/storage/verifying_iterator.go line 119 at r1 (raw file):

	if err != nil {
		// Defensive code: Returning the actual length in the error case is not
		// necessary.

Removed


pkg/storage/sst_writer.go line 62 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I don't think we can. This is used e.g. for Raft snapshots and C2C replication, which carry version history.

Ack. Removed.


pkg/storage/sst_writer.go line 78 at r1 (raw file):

// MakeBackupSSTWriter creates a new SSTWriter tailored for backup SSTs which
// are typically only ever iterated in their entirety.
// TODO: disable keeping older versions in value blocks.

Removed.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the value size distribution, but will look into CC later. There is anecdotal evidence that there are enough workloads with a high value-size/key-size ratio.

Thanks, I was mostly curious.

Do you have some codepaths in mind that only need the LocalTimestamp?

Unnecessary value decoding in pebbleMVCCScanner uncertainty checks would be the main one (but haven't looked at it in detail). It would also have been beneficial in the import ID case, since we could avoid decoding a large number of values that do not match the import ID when doing a predicate scan.

the more recent version will typically be the latest version for that key, so will continue to be cheap with value-blocks. When full separation of values (into separate files) even the latest version may be expensive to retrieve, so yes in that case we should keep the MVCCValueHeader close-by. The interface changes to access only the MVCCValueHeader can be done when we get closer to that step.

I see. I think most callers would care about the latest version, so I agree that this doesn't seem useful until we have full value separation.

Reviewed 11 of 11 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/engine.go line 182 at r1 (raw file):

Previously, sumeerbhola wrote…

Our encoding/decoding story is already differentiated based on whether timestamp is empty or not -- for the former we always use MVCCMetadata, including MVCCMetadata.RawBytes, which only recently I realized is also how raft log entries are encoded. Which also means MVCCIterator.ValueProto is quite useless in its generality -- it can only be used for MVCCMetadata. So most code has to jump through the process of unmarshaling the MVCCMetadata, then sticking the MVCCMetadata.RawBytes into roachpb.Value.RawBytes, and then call roachpb.Value.GetProto. All of which happens above the storage package.

I wanted to sidestep this messiness, by saying that this interface is only for versioned/timestamped values, so the caller is asserting that it should be a MVCCValue (which the caller anyway has to know).
We have precedent for making the caller differentiate in the engine interfaces: Writer has {ClearMVCC,ClearUnversioned,ClearIntent}. There are also many methods with the term MVCC in the Writer interface, so we are simply following that same naming model.

And then we don't need to discard an error (which always make me nervous).

This is fine. Ignoring the error was a bad idea.


pkg/storage/engine.go line 1584 at r2 (raw file):

		}
		if key.IsValue() {
			valueLen2, _, err := iter.MVCCValueLenAndIsTombstone()

nit: for completeness, let's assert that when true, UnsafeValue() decodes to an MVCC range tombstone, and vice-versa.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/engine.go line 1584 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: for completeness, let's assert that when true, UnsafeValue() decodes to an MVCC range tombstone, and vice-versa.

Done

SimpleMVCCIterator already separates the methods to retrieve the
key and value, and callers who do not need any information about the
value do not call UnsafeValue(). However, there are callers (like
stats calculation) who need to know the length of the value and
in some cases whether an MVCC value is a tombstone. The
MVCCValueLenAndIsTombstone() and ValueLen() methods are introduced
for these cases. This is preparation for the near future where
retrieving the value in Pebble will have actual additional cost,
since the value may be in a different part of the file or a different
file.

Non-test callers that can use these new interfaces have been
modified. There are some todos to modify some non-trivial callers,
related to deciding what to GC, and checking sstable conflicts,
which will happen in future PRs.

When new methods are introduced to retrieve these value attributes
in pebble.Iterator, only pebbleIterator.{MVCCValueLenAndIsTombstone,
ValueLen} will need to be modified to make use of them.

Release note: None
@sumeerbhola
Copy link
Collaborator Author

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Oct 5, 2022

Build succeeded:

@craig craig bot merged commit 5a730da into cockroachdb:master Oct 5, 2022
@sumeerbhola sumeerbhola deleted the val_sep1 branch November 18, 2022 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants