Skip to content

Conversation

@benesch
Copy link
Contributor

@benesch benesch commented Jun 28, 2018

Fix a known correctness bug in the merge locking scheme. See the comment
updates within for details.

As an added benefit, this fix frees store.RemoveReplica of conditionals
to deal with uninitialized replicas. Acquiring the merge lock now always
results in an initialized right-hand replica, so store.MergeRange is
always removing an initialized right-hand replica.

There are no tests for this because testing this particular edge case is
quite difficult and I'm primarily interested in avoiding the obviously
wrong call to batch.ClearRange(nil, nil) that occurs when calling
store.RemoveReplica on an unintialized replica. This nonsensical
ClearRange is, somehow, not currently an error, but it is preventing
#26872 ("engine: require iterators to specify an upper bound") from
landing.

Release note: None


P.S. The storage.RemoveOptions struct now bundles precisely one parameter. Let me know if you'd prefer I change back the signature of store.removeReplicaInternal to take the bool directly. Bool parameters are frowned upon, though, so I wasn't sure what was preferred.

@benesch benesch requested review from a team, bdarnell and tbg June 28, 2018 18:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch benesch force-pushed the merge-placeholder branch 3 times, most recently from dff0338 to c3db5fe Compare June 28, 2018 19:41
@tbg
Copy link
Member

tbg commented Jul 2, 2018

Reviewed 2 of 2 files at r1.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5162 at r1 (raw file):

		// rightRepl. This adds rightRepl to the replicasByKey map so that snapshots
		// for overlapping keyspaces will block on its raftMu.
		if err := rightRepl.setDesc(&merge.RightDesc); err != nil {

I'm almost certain this is not what we want. By setting a descriptor here, you make this an initialized replica, so it's not just a placeholder -- it's a full-blown replica that thinks it's initialized (see (*Replica).IsInitialized()). There's maybe not a huge problem with this in the happy case because you'll immediately merge this replica away, but you definitely don't undo what you just did in the unhappy case. Plus, this Replica is likely to explode, since it's not in its own RangeDescriptor, and even more so, it might send Raft messages that possibly disrupt the peers. I really think we only want a placeholder here (or some argument that the old code is sufficient, but it's not clear to me).


Comments from Reviewable

@benesch
Copy link
Contributor Author

benesch commented Jul 2, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5162 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm almost certain this is not what we want. By setting a descriptor here, you make this an initialized replica, so it's not just a placeholder -- it's a full-blown replica that thinks it's initialized (see (*Replica).IsInitialized()). There's maybe not a huge problem with this in the happy case because you'll immediately merge this replica away, but you definitely don't undo what you just did in the unhappy case. Plus, this Replica is likely to explode, since it's not in its own RangeDescriptor, and even more so, it might send Raft messages that possibly disrupt the peers. I really think we only want a placeholder here (or some argument that the old code is sufficient, but it's not clear to me).

It can't send Raft messages because we hold its raftMu, right? And unless I'm mistaken the merge cannot fail after this point. If it does the node crashes.

I agree that the old code was not safe.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 2, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5162 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

It can't send Raft messages because we hold its raftMu, right? And unless I'm mistaken the merge cannot fail after this point. If it does the node crashes.

I agree that the old code was not safe.

It's true that the code has no way to gracefully fail after this lock is acquired, at least not today. I think you're also right about not being able to handle/send Raft messages, and it's nice that this unifies two code paths (by making the RHS replica always "exist").

But at the same time raftMu isn't what protects the descriptor, that's r.mu. So you're still putting a fake-initialized map into the store for the world to see. Even if this worked today and never got broken, I still think it's not clear and it will always be confusing and something to keep in mind elsewhere. We have replica placeholders for pretty much this sort of thing; I'd much prefer if we used them instead.


Comments from Reviewable

@benesch
Copy link
Contributor Author

benesch commented Jul 2, 2018

Hmm, ok. I'll be more comfortable using the combination of a placeholder and an uninitialized replica if we can clean up the cleanup code (heh). A failed split sometimes does this:

rightRng.mu.Lock()
rightRng.mu.destroyStatus.Set(errors.Errorf("%s: failed to initialize", rightRng), destroyReasonRemoved)
rightRng.mu.Unlock()
r.store.mu.Lock()
r.store.unquiescedReplicas.Lock()
delete(r.store.unquiescedReplicas.m, rightRng.RangeID)
r.store.unquiescedReplicas.Unlock()
r.store.mu.replicas.Delete(int64(rightRng.RangeID))
delete(r.store.mu.uninitReplicas, rightRng.RangeID)
r.store.replicaQueues.Delete(int64(rightRng.RangeID))
r.store.mu.Unlock()

And sometimes this:

cockroach/pkg/storage/store.go

Lines 2239 to 2244 in c094d1d

delete(s.mu.uninitReplicas, newDesc.RangeID)
s.unquiescedReplicas.Lock()
delete(s.unquiescedReplicas.m, newDesc.RangeID)
s.unquiescedReplicas.Unlock()
s.mu.replicas.Delete(int64(newDesc.RangeID))
s.replicaQueues.Delete(int64(newDesc.RangeID))

But then all that logic is also duplicated in

cockroach/pkg/storage/store.go

Lines 2536 to 2541 in c094d1d

s.unquiescedReplicas.Lock()
delete(s.unquiescedReplicas.m, rep.RangeID)
s.unquiescedReplicas.Unlock()
s.mu.replicas.Delete(int64(rep.RangeID))
delete(s.mu.uninitReplicas, rep.RangeID)
s.replicaQueues.Delete(int64(rep.RangeID))
.

And for some reason store.removeReplicaImpl can remove a placeholder replica??

delete(s.mu.replicaPlaceholders, rep.RangeID)
Except it blew up if you passed it a placeholder replica until I plumbed the AllowPlaceholders option, so why did that line exist before?

@benesch
Copy link
Contributor Author

benesch commented Jul 2, 2018

tl;dr is there's another PR coming your way to unify this logic first.

@benesch
Copy link
Contributor Author

benesch commented Jul 3, 2018

Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5162 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

It's true that the code has no way to gracefully fail after this lock is acquired, at least not today. I think you're also right about not being able to handle/send Raft messages, and it's nice that this unifies two code paths (by making the RHS replica always "exist").

But at the same time raftMu isn't what protects the descriptor, that's r.mu. So you're still putting a fake-initialized map into the store for the world to see. Even if this worked today and never got broken, I still think it's not clear and it will always be confusing and something to keep in mind elsewhere. We have replica placeholders for pretty much this sort of thing; I'd much prefer if we used them instead.

Ok, reworked this. How's it look now?


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jul 3, 2018
27112: roachpb: replace `gogoproto.onlyone` with `oneof` in BatchRequest/BatchResponse r=nvanbenschoten a=nvanbenschoten

All Requests and Responses pass through RequestUnion/ResponseUnion structs
when they are added to BatchRequests/BatchResponses. In order to ensure
that only one Request type can be assigned to one of these RequestUnion
or ResponseUnion structs, we currently use gogoproto's approach to tagged
unions: the `gogoproto.onlyone` option.

This option was introduced before proto3. Proto3
then added the `oneof` option, which for all intents and purposes addresses
the same issue: https://developers.google.com/protocol-buffers/docs/proto#oneof.
However, there is one major difference between the two options, which
is in their generated code. `gogoproto.onlyone` will generate
a single flat struct with pointers to each possible variant type.
`oneof` will generate a union interface and an interface "wrapper"
struct for each variant type. The effect of this is that `onlyone`
will generate code that looks like this:

```
type Union struct {
    Variant1 *Variant1Type
    Variant2 *Variant2Type
    ...
}
```

While `oneof` will generate code the looks like this:

```
type Union struct {
    Value isUnion_Value
}

type isUnion_Value interface {
    ...
}

type Union_Variant1 struct {
    Variant1 *Variant1Type
}

type Union_Variant2 struct {
    Variant2 *Variant2Type
}
```

There are pretty obvious tradeoffs to each. For one, `oneof` introduces an
extra layer of indirection, which forces an extra allocation. It also doesn't
generate particularly useful setters and getters. On the other hand, `onlyone`
creates a large struct that grows linearly with the number of variants.
Neither approach is great, and there has been **A LOT** of discussion on this:
- golang/protobuf#78
- golang/protobuf#283
- gogo/protobuf#103
- gogo/protobuf#168

Clearly neither approach is ideal, ergonomically or with regard to performance.
However, over time, the tradeoff has been getting worse for us and it's time we
consider switching over to `oneof` in `RequestUnion` and `ResponseUnion`. These
structs have gotten huge as more and more request variants have been added:
`RequestUnion` has grown to **328 bytes** and `ResponseUnion` has grown to **320 bytes**.
It has gotten to the point where the wasted space is non-negligible.

This change switches over to `oneof` to shrink these union structs down to more
manageable sizes (16 bytes each). The downside of this is that in reducing the struct
size we end up introducing an extra allocation. This isn't great, but we can avoid
the extra allocation in some places (like `BatchRequest.CreateReply`) by grouping
the allocation with that of the Request/Response itself. We've seen previous cases
like #4216 where adding in an extra allocation/indirection is a net-win if it
reduces a commonly used struct's size significantly.

The other downside to this change is that the ergonomics of `oneof` aren't quite
as nice as `gogo.onlyone`. Specifically, `gogo.onlyone` generates getters and
setters called `GetValue` and `SetValue` that provide access to the wrapped
`interface{}`, which we can assert to a `Request`. `oneof` doesn't provide
such facilities. This was the cause of a lot of the discussions linked above.
While it we be nice for this to be resolved upstream, I think we've waited long
enough (~3 years) for a resolution to those discussions. For now, we'll just
generate the getters and setters ourselves.

This change demonstrated about a **5%** improvement when running kv95 on my local
laptop. When run on a three-node GCE cluster (4 vCPUs), the improvements were
less pronounced but still present. kv95 showed a throughput improvement of **2.4%**.
Running kv100 showed a much more dramatic improvement of **18%** on the three-node
GCE cluster. I think this is because kv100 is essentially a hot loop where all reads miss
because the cluster remains empty, so it's the best-case scenario for this change. Still,
the impact was shocking.

Release note (performance improvement): Reduce the memory size of commonly used
Request and Response objects.

27114: opt/sql: fix explain analyze missing option r=asubiotto a=asubiotto

ConstructExplain previously ignored the ANALYZE option so any EXPLAIN
ANALYZE statement would result in execution as an EXPLAIN (DISTSQL)
statement. The ANALYZE option is now observed in ConstructExplain.

Additionally, the stmtType field from the explainDistSQLNode has been
removed because it was not necessary and it was unclear how to pass this
from the `execFactory`.

Release note: None

27116: Makefile: learn that roachtest depends on optimizer-generated code r=benesch a=benesch

As mentioned in cd4415c, the Makefile will one day be smart enough to
deduce this on its own, but for now it's simpler to explicitly list the
commands that require generated code. Note that the simple but coarse
solution of assuming that all commands depend on generated code is
inviable as some of these commands are used to generate the code in the
first place.

Release note: None

27119: storage: extract replica unlinking into store method r=tschottdorf a=benesch

Extract some code that was duplicated in three places into a dedicated
helper method. Prerequisite for #27061.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Alfonso Subiotto Marqués <alfonso@cockroachlabs.com>
Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@tbg
Copy link
Member

tbg commented Jul 3, 2018

:lgtm: though PR/commit message might've rotted (can't really see this in reviewable).


Reviewed 2 of 2 files at r2.
Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 5215 at r2 (raw file):

	return func(storagebase.ReplicatedEvalResult) {
		rightRepl.raftMu.Unlock()

In the interest of future proofing, can you remove the placeholder if there is one (in reality, today, the merge looks like it will commit and should remove the placeholder in the process by destroying the RHS, right?).


pkg/storage/store.go, line 2373 at r1 (raw file):

// addPlaceholderLocked adds the specified placeholder. Requires that Store.mu
// and Replica.raftMu are held.

Is there some obscure requirement here about some uninitialized replica for the given RangeID?


Comments from Reviewable

@benesch benesch force-pushed the merge-placeholder branch from a15e5b3 to 46f04e6 Compare July 3, 2018 20:03
@benesch
Copy link
Contributor Author

benesch commented Jul 3, 2018

Ugh, I had to rework this slightly. What I was doing before was wrong. It's much easier to handle the uninitialized case separately from the initialized case. PTAL.


Review status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 5215 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

In the interest of future proofing, can you remove the placeholder if there is one (in reality, today, the merge looks like it will commit and should remove the placeholder in the process by destroying the RHS, right?).

Done.


pkg/storage/store.go, line 2373 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Is there some obscure requirement here about some uninitialized replica for the given RangeID?

Oops, yes. #11657

I updated the comment to be far more clear.


Comments from Reviewable

@benesch benesch force-pushed the merge-placeholder branch 2 times, most recently from 8ab71a0 to 779cd00 Compare July 3, 2018 23:01
tbg
tbg previously requested changes Jul 4, 2018
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: just curious, what was the problem with the old code?

Reviewed 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 5215 at r3 (raw file):

			r.store.mu.Lock()
			if _, ok := r.store.mu.replicasByKey.Get(placeholder).(*ReplicaPlaceholder); ok {
				log.Fatalf(ctx, "merge did not remove placeholder %+v from replicasByKey", placeholder)

Just for OCD, would you Unlock here and below before Fatal'ing? This won't matter in most cases, but if the stringer for something passed into the Fatal would need the store lock, it would deadlock in a nondescript way.

@benesch benesch force-pushed the merge-placeholder branch from 779cd00 to d7c9c46 Compare July 4, 2018 21:25
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I had plumbed opts.AllowPlaceholders was preventing the placeholder from getting removed immediately from replicasByKey. Whoops. (It was getting removed on the next Raft tick or something, so it wasn't causing problems, but still wonky.) Anyway, it's much easier to reason about the code if uninitialized replicas never make it to removeReplicaImpl.

Thanks for the review!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 5215 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just for OCD, would you Unlock here and below before Fatal'ing? This won't matter in most cases, but if the stringer for something passed into the Fatal would need the store lock, it would deadlock in a nondescript way.

Done.

@benesch
Copy link
Contributor Author

benesch commented Jul 4, 2018

bors r=tschottdorf

@craig
Copy link
Contributor

craig bot commented Jul 4, 2018

Merge conflict

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)

Fix a known correctness bug in the merge locking scheme. See the comment
updates within for details.

As an added benefit, this fix frees store.RemoveReplica of conditionals
to deal with uninitialized replicas.

There are no tests for this because testing this particular edge case is
quite difficult and I'm primarily interested in avoiding the obviously
wrong call to batch.ClearRange(nil, nil) that occurs when calling
store.RemoveReplica on an unintialized replica. This nonsensical
ClearRange is, somehow, not currently an error, but it is preventing
landing.

Release note: None
@benesch benesch force-pushed the merge-placeholder branch from d7c9c46 to 0712a55 Compare July 5, 2018 15:01
@benesch
Copy link
Contributor Author

benesch commented Jul 5, 2018

bors r=tschottdorf,bdarnell

craig bot pushed a commit that referenced this pull request Jul 5, 2018
27061: storage: improve safety of merge lock r=tschottdorf,bdarnell a=benesch

Fix a known correctness bug in the merge locking scheme. See the comment
updates within for details.

As an added benefit, this fix frees store.RemoveReplica of conditionals
to deal with uninitialized replicas. Acquiring the merge lock now always
results in an initialized right-hand replica, so store.MergeRange is
always removing an initialized right-hand replica.

There are no tests for this because testing this particular edge case is
quite difficult and I'm primarily interested in avoiding the obviously
wrong call to batch.ClearRange(nil, nil) that occurs when calling
store.RemoveReplica on an unintialized replica. This nonsensical
ClearRange is, somehow, not currently an error, but it is preventing
#26872 ("engine: require iterators to specify an upper bound") from
landing.

Release note: None

---

P.S. The `storage.RemoveOptions` struct now bundles precisely one parameter. Let me know if you'd prefer I change back the signature of `store.removeReplicaInternal` to take the bool directly. Bool parameters are frowned upon, though, so I wasn't sure what was preferred.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 5, 2018

Build succeeded

@craig craig bot merged commit 0712a55 into cockroachdb:master Jul 5, 2018
@benesch benesch deleted the merge-placeholder branch July 10, 2018 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants