-
Notifications
You must be signed in to change notification settings - Fork 3.9k
all: host a DRPC server and use it for node-node BatchRequests #138368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: host a DRPC server and use it for node-node BatchRequests #138368
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
f2e9f41
to
dc42fec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you for putting this up so quickly! Most of my (many) comments are minor. The one meta question I have is around what all of these drpc parts do when drpc is off (i.e. in production). We don't host the server, but in rpc.Context
we seem to make lots of drpc-related things anyway. Can we think about what a clean story should look like here? We definitely don't want any ill-fated attempts to establish drpc connections unless opting in has happened, and I don't want to rely on a haphazard set of checks for this. A solution that seems appealing to me is to have a mock implementation of drpcpool.Conn
that is returned from dialDRPC
if drpc is off. That impl would simply error out.
Something else that may make sense for the prototype is to send heartbeat pings over drpc, if drpc is enabled (drpc otherwise; check in each iteration of the loop). This would solidify the mixed mode by actively re-establishing all connections when the cluster setting is fixed. This will briefly interrupt service, but hey, that's totally fine for a prototype. (For production we'd need a more complicated lifecycle, where the drpc connection can be established separately, later. In fact, we may want to gate drpc behind a different connection class, so that each *Connection
is exclusively drpc or grpc, i.e. they never mix; this will result in a clean model).
Together with the mock drpcpool.Conn we'd have the following behavior if the cluster setting is flipped half-way through:
- setting off, uses grpc normally, all *Connections have a mock drpc conn
- setting flipped: all connections break (due to errors from the mock drpc conn) and get reestablished right away
- using drpc
Seems fine for a prototype.
One last (preemptive) request: could you not force-push this PR? Instead, add each change as a fixup commit (I like git commit --fixup :/<someregexpthatmatchesthecommitweultimatelywanttosquashthisinto>
) and push it. That way, reviewable stays clean and I can review all changes in the order you make them. Before merging, we can then git rebase -i --autosquash
and git diff
against the un-rebased version to convince ourselves we're committing the same combined diff we reviewed.
Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 4 of 4 files at r5, 6 of 6 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
-- commits
line 49 at r5:
Is there a reason you're not introducing the test (which made an ad-hoc drpc client and sent a LeaseInfoRequest) from my prototype here or in the next commit? This would serve as validation that the drpc server setup actually works.
pkg/rpc/peer.go
line 257 at r6 (raw file):
return rpcCtx.grpcDialRaw(ctx, target, class, additionalDialOpts...) }, dialDRPC: func(ctx context.Context, target string) (drpcpool.Conn, error) {
it makes sense to me to move this dial function to <something>_drpc.go
. Maybe peer_drpc.go
? Or a new dial_drpc.go
.
pkg/rpc/peer.go
line 441 at r6 (raw file):
} // TODO(chandrat) Check if this works in mixed mode?
what's mixed mode? Also, I'm unclear whether you are planning to address this TODO in the PR. If it's in this PR, mind adding a markdown checkbox to the PR description instead and removing this TODO?
- [ ] Check if
(*peer).runOnce works in mixed mode
.
pkg/rpc/peer.go
line 1013 at r6 (raw file):
} type closeEntirePoolConn struct {
ditto for moving to a *_drpc.go
file.
pkg/rpc/connection.go
line 106 at r6 (raw file):
// conn could be unhealthy (or there may not even be a conn, i.e. Err() != // nil), if that's what the caller wanted (ConnectNoBreaker). return c.connFuture.Conn(), c.connFuture.DRPCConn(), c.connFuture.Err()
what happens here if drpc is disabled?
pkg/rpc/connection.go
line 120 at r6 (raw file):
} func (c *Connection) Connect2(ctx context.Context) (*grpc.ClientConn, drpcpool.Conn, error) {
This needs a comment and possibly a better name.
// Connect2 is like Connect, but returns both a grpc and drpc connection.
pkg/kv/kvpb/api_drpc_hacky.go
line 1 at r5 (raw file):
// Code generated by protoc-gen-go-drpc. DO NOT EDIT.
Can you remove this "generated file" marker and replace it with an explanation of how this file was created? Something like:
// This file was created by generating an ad-hoc version of echo_gogo_drpc.go
in [1] and changing service names and signatures appropriately.
// This code-gen should be automated at some point as part of productionizing drpc.
//
// [1]: https://github.com/cockroachlabs/perf-efficiency-team/tree/03c57531dcc25e768ecb564427a9d9fc02e4e8f7/rpctoy/toypb
pkg/kv/kvpb/api_drpc_hacky.go
line 127 at r5 (raw file):
switch n { case 0: return "/cockroach.roachpb.Internal/Batch", drpcEncoding_File_api_proto{},
There's a chance here to change the service names. We were stuck with cockroach.roachpb.Internal
for gRPC (the service was originally part of roachpb
but this got refactored at some point) but can use cockroach.kv.kvpb.Internal/Batch
here.
pkg/server/start_listen.go
line 138 at r4 (raw file):
// Host drpc only if it's _possible_ to turn it on (this requires a test build // or env var). If the setting _is_ on, but it should not be possible for it
it is possible! If the env var is set, the setting will be on, because the env var also informs the default (unless I've missed how this works).
pkg/server/start_listen.go
line 144 at r4 (raw file):
rpc.ExperimentalDRPCEnabled.Get(&cfg.Settings.SV) // Make a listener that doesn't actually do anything. We will start the dRPC
nit: "If we're not hosting drpc, make a listener that never accepts anything. We will start ..."
pkg/server/start_listen.go
line 229 at r4 (raw file):
} var drpcMatcher = func(reader io.Reader) bool {
Thoughts about extracting all of this drpc-specific stuff to start_listen_drpc.go
?
pkg/server/start_listen.go
line 134 at r4 (raw file):
} anyL := m.Match(cmux.Any())
not worth git surgery now (seriously 😄 ), but in an ideal world this rename could've been its own commit.
pkg/rpc/peer_drpc.go
line 21 at r2 (raw file):
) var envExperimentalDRPCEnabled = envutil.EnvOrDefaultBool("COCKROACH_EXPERIMENTAL_DRPC_ENABLED", true)
this should be false, right? And then be enabled selectively in tests via a testing knob.
pkg/rpc/peer_drpc.go
line 30 at r2 (raw file):
"rpc.experimental_drpc.enabled", "if true, use drpc to execute Batch RPCs (instead of gRPC)", metamorphic.ConstantWithTestBool("rpc.experimental_drpc.enabled", envExperimentalDRPCEnabled),
We should probably stay away from this. We don't want to randomly use drpc across the codebase. For example, consider a test that verifies authorization: it will randomly break with drpc. For now, I think we should default this to the env var verbatim without randomization. It does make sense to do a one-off test run at some point to see what tests break under drpc, but this isn't something we'd want to do as part of the prototype.
pkg/rpc/nodedialer/nodedialer.go
line 161 at r7 (raw file):
RestrictedInternalClient: client, // for RangeFeed only drpcClient: kvpb.NewDRPCInternalClient(dconn), drpcStreamPool: drpcBatchStreamPool,
do we do this unconditionally? Note that for grpc we're checking shouldUseBatchStreamPoolClient
. We shouldn't use the pool client unless the corresponding setting is set.
I would also leave a comment here that the gRPC version of the stream reuse is currently more allocation-optimized in that the batch stream pool itself implmenents rpc.RestrictedInternalClient
, whereas here we allocate a new throw-away unaryDRPCBatchServiceToInternalAdapter
. This is fine - but it's worth leaving a note about it. I didn't pursue this more due to the need to keep the original client
around for the RangeFeed
, which isn't supported on drpc. Once that did support it, I suspect we could avoid the allocation in the same way gRPC does.
pkg/kv/kvpb/BUILD.bazel
line 21 at r5 (raw file):
":gen-batch-generated", # keep ":gen-errordetailtype-stringer", # keep ":gen-method-stringer", # keep
why this movement?
pkg/rpc/context.go
line 84 at r3 (raw file):
} type DrpcServer struct {
nit: Go style guide suggests that this should be DRPCServer
.
pkg/rpc/context.go
line 209 at r3 (raw file):
} func newDRPCServer(ctx context.Context, rpcCtx *Context) (*DrpcServer, error) {
do you think it makes sense to put this into context_drpc.go
?
pkg/rpc/nodedialer/nodedialer.go
line 172 at r6 (raw file):
} type unaryDRPCBatchServiceToInternalAdapter struct {
Would also extract this to a _drpc.go file.
DEPS.bzl
line 9229 at r1 (raw file):
build_file_proto_mode = "disable_global", importpath = "github.com/zeebo/errs", # TODO: mirror this repo (to fix, run `./dev generate bazel --mirror`)
this still needs to be done (in a fixup commit please, let's not rewrite history in this PR please):
./dev gen bazel --mirror
git add DEPS.bzl
git commit --fixup :/add.drpc.dep
git push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvpb/BUILD.bazel
line 21 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
why this movement?
Sorry, what was this comment about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
pkg/kv/kvpb/BUILD.bazel
line 21 at r5 (raw file):
Previously, cthumuluru-crdb (Chandra Thumuluru) wrote…
Sorry, what was this comment about?
why did these:
move up? Was this just spurious cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
pkg/kv/kvpb/api_drpc_hacky.go
line 10 at r9 (raw file):
// package cockroach.kv.kvpb; // option go_package = "github.com/cockroachdb/cockroach/pkg/kv/kvpb"; // service Batch {
Is the idea that when we productionize this and host drpc for all RPCs, we make different services for the rest? In other words, are you thinking we'll hop on the opportunity to decompose the grab bag that is service Internal
1? I'm up for that; consider leaving a comment. I don't see any downsides to doing this, but it has advantages (we get more, smaller, interfaces and so we will have fewer half-implemented versions laying around in testing).
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/context.go
line 209 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
do you think it makes sense to put this into
context_drpc.go
?
I create a new file drpc.go
in rpc package and moved it there.
pkg/rpc/peer_drpc.go
line 21 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
this should be false, right? And then be enabled selectively in tests via a testing knob.
I took this change as-is from the original prototype PR. I was planing to have a similar discussion with you. Fixed it now.
pkg/rpc/peer_drpc.go
line 30 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We should probably stay away from this. We don't want to randomly use drpc across the codebase. For example, consider a test that verifies authorization: it will randomly break with drpc. For now, I think we should default this to the env var verbatim without randomization. It does make sense to do a one-off test run at some point to see what tests break under drpc, but this isn't something we'd want to do as part of the prototype.
I took this change as-is from the original prototype PR. I was planing to have a similar discussion with you. Fixed it now.
pkg/rpc/nodedialer/nodedialer.go
line 161 at r7 (raw file):
do we do this unconditionally?
Since we would like the entire drpc code paths to be behind a flag, I think it is okay to do it unconditionally for now.
I would also leave a comment here that the gRPC version of the stream reuse is currently more allocation-optimized in that the batch stream pool itself implmenents
rpc.RestrictedInternalClient
, whereas here we allocate a new throw-awayunaryDRPCBatchServiceToInternalAdapter
. This is fine - but it's worth leaving a note about it. I didn't pursue this more due to the need to keep the originalclient
around for theRangeFeed
, which isn't supported on drpc. Once that did support it, I suspect we could avoid the allocation in the same way gRPC does.
Done
pkg/server/start_listen.go
line 134 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
not worth git surgery now (seriously 😄 ), but in an ideal world this rename could've been its own commit.
Yeah, you are right. I wish I had done that in a separate commit.
pkg/server/start_listen.go
line 138 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
it is possible! If the env var is set, the setting will be on, because the env var also informs the default (unless I've missed how this works).
Updated the comment. If the environment variable is enable of is the build is a test build, we host DRPC server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! We're closing in.
Reviewed 2 of 2 files at r11, 2 of 2 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15, 6 of 6 files at r16, 1 of 1 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 1 of 1 files at r20, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
pkg/server/drpc_test.go
line 34 at r20 (raw file):
// serves BatchRequest. It doesn't verify that nodes use drpc to communiate with // each other. func TestTestClusterDRPC(t *testing.T) {
I regretted that name the moment I typed it, how about TestDRPCBatchServer
pkg/server/drpc_test.go
line 63 at r20 (raw file):
tlsCfg, err := cm.GetNodeClientTLSConfig() // manager closed: tls: either ServerName or InsecureSkipVerify must be
This comment could use a few lines of explainer that this is a workaround and we should understand how this should actually work.
Or refer to the other place that does this (in rpc.Context?) and make sure there's a good comment there.
Basically this is something we'd need to look into as we productionize. It's probably something we just got away with in testing since grpc fills in a server for us if none is provided. That is then likely the right thing for us to do as well, rather than setting this flag.
Either way, just something to document for now, not to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
Previously, tbg (Tobias Grieger) wrote…
Is there a reason you're not introducing the test (which made an ad-hoc drpc client and sent a LeaseInfoRequest) from my prototype here or in the next commit? This would serve as validation that the drpc server setup actually works.
Added the test from your draft PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r21, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
Previously, cthumuluru-crdb (Chandra Thumuluru) wrote…
Added the test from your draft PR.
Great. We'll also want one that runs a TestCluster with drpc enabled, to test the dialing through rpc.Context
. I don't think I added that in the prototype, since I had drpc on by default and so there were plenty of tests I could use for coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/server/drpc_test.go
line 34 at r20 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I regretted that name the moment I typed it, how about
TestDRPCBatchServer
Done
pkg/server/drpc_test.go
line 63 at r20 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This comment could use a few lines of explainer that this is a workaround and we should understand how this should actually work.
Or refer to the other place that does this (in rpc.Context?) and make sure there's a good comment there.
Basically this is something we'd need to look into as we productionize. It's probably something we just got away with in testing since grpc fills in a server for us if none is provided. That is then likely the right thing for us to do as well, rather than setting this flag.Either way, just something to document for now, not to fix.
I fixed the test. Can you please check now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r22, 2 of 2 files at r23, 1 of 1 files at r24, 1 of 1 files at r25, 4 of 4 files at r26, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
pkg/rpc/drpc.go
line 28 at r26 (raw file):
type DRPCServer struct { Srv *drpcServerWrapper
I would have expected an interface here instead of the wrapper. Can you give this a try unless you have already and found a roadblock, in which case please explain?
*drpcserver.Server
would be one implementation, and type drpcDisabledServer struct{}
would be the other implementation.
Btw I think the wrapped noop implementation you have right now does not do what you want, since it does not return errors (despite what the comments claim).
pkg/rpc/drpc.go
line 34 at r26 (raw file):
func newDRPCServer(_ context.Context, rpcCtx *Context) (*DRPCServer, error) { var dmuxw *drpcMuxWrapper = &drpcMuxWrapper{}
nit: you don't need to specify *drpcMuxWrapper
and *drpcServerWrapper
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/drpc.go
line 28 at r26 (raw file):
I considered the interface approach too but I didn't see a clear benefit over the wrapper approach. With interfaces we would have to implement one mock and an adapter to DRPC impl. Do you see any downside with wrapper?
Btw I think the wrapped noop implementation you have right now does not do what you want, since it does not return errors (despite what the comments claim).
Hhhm. It was lost in testing. I'll fix ServeOne()
to throw an error. Serve()
rethrows the error it receives from the mock listener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r27, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)
pkg/rpc/drpc.go
line 28 at r26 (raw file):
Do you see any downside with wrapper?
It's a little more error prone, leaks into the production stack traces, and has more code. I pushed a commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
Previously, tbg (Tobias Grieger) wrote…
Great. We'll also want one that runs a TestCluster with drpc enabled, to test the dialing through
rpc.Context
. I don't think I added that in the prototype, since I had drpc on by default and so there were plenty of tests I could use for coverage.
Done.
`drpc` ([github.com/storj/drpc](https://github.com/storj/drpc)) is a lightweight, drop-in replacement to gRPC. Based on the initial benchmark results from the Perf Efficiency team using a simple ping service ([github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy](https://github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy)), switching from gRPC to drpc offers significant performance gains. This commit introduces a dependency on drpc to facilitate faster experimentation and iteration on the prototype. Epic: None Release note: None
DRPC is an experimental feature and is not yet ready for production. It doesn't implement authorization checks. This PR introduces an application setting that can be activated only in test builds or by explicitly setting an environment variable to opt in. Epic: None Release note: None
Set up the dRPC server and pass it up the stack using the gRPC server wrapper. Although this approach is somewhat awkward, it is sufficient for the prototype and can be easily refined in subsequent commits. Epic: None Release note: None
Start the DRPC listener to handle server connections. Depending on whether the experiment is enabled, the listener will either be a noop dummy listener or the actual implementation. Epic: None Release note: None
Register KV batch and streaming batch methods from internal service with DRPC server. After these server side changes, when hosting DRPC is enabled, clients can connect to KV nodes and make DRPC batch and streaming batch requests. Client changes will be done in a separate commit. Epic: None Release note: None
This commit allows `*rpc.Connection` to also maintain a `drpcpool.Conn`. When drpc is enabled, clients have a choice to either use gRPC or drpc. Epic: None Release note: None
Use mock drpc server and muxer implementation when DRPC is disabled. Epic: None Release note: None
4dcb83f
to
3b07ca5
Compare
Epic: None Release note: None
It wasn't actually exercising drpc. This is because by the time the env var is overridden, the previous value has already been picked up by the cluster setting. Changed the test to override the setting instead. Also, removed the drpc-off run of the test: all other tests already run with drpc off, so and the test takes a second or two, so not worth it. Instead, added a secure flavor. This secure flavor promptly failed (once drpc was actually enabled) because it needed the same hack as the original prototype[1], which I then proceeded to add. [1]: https://github.com/cockroachdb/cockroach/pull/136794/files#diff-fc54146a422d3d59215aae7d10ecb55771db57f0f49b0b0fa34cd64358979ca0R293
3b07ca5
to
b0ea57c
Compare
bors r+ |
Ran 20x to see if the result changes, but still, solid lead for master without this code.
|
drpc (github.com/storj/drpc) is a lightweight, drop-in replacement for gRPC. Initial benchmark results from the Perf Efficiency team, using a simple ping service (github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy),
demonstrate that switching from gRPC to drpc delivers significant performance improvements.
This pull request introduces experimental support for using drpc to serve BatchRequests. The feature is disabled by default but can be enabled explicitly by setting the
COCKROACH_EXPERIMENTAL_DRPC_ENABLED
environment variable or automatically in CRDB test builds. The prototype lacks key features, such as authorization checks, interceptors, etc, and is not production-ready.When DRPC is disabled, sever, muxer and listener mock implementations are injected. If it is enabled real implementations are used. DRPC is only used for BatchRequests. This PR also added support for pooled streaming unary operations similar to #136648.
Added unit tests to ensure:
Note: drpc protocol generation is not yet integrated into the build process. Protobuf files were generated manually and committed directly to the repository.
This PR has a few parts:
rpc.NewServerEx
, we now also set up a drpc server and return it up the stack.used to make BatchRequests, if not gRPC is used.
DRPC!!!1
header and serve the resulting listener on the drpc server. The drpc example usesdrpcmigrate.ListenMux
instead of cmux; we keep cmux but must make sure the header is read off the connection before delegating (which the drpxmigrate mux would have done for us).Closes #136794.
Touches #137252.
Epic: CRDB-42584
Release note: None