Skip to content

all: host a DRPC server and use it for node-node BatchRequests #138368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 14, 2025

Conversation

cthumuluru-crdb
Copy link
Contributor

@cthumuluru-crdb cthumuluru-crdb commented Jan 7, 2025

drpc (github.com/storj/drpc) is a lightweight, drop-in replacement for gRPC. Initial benchmark results from the Perf Efficiency team, using a simple ping service (github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy),
demonstrate that switching from gRPC to drpc delivers significant performance improvements.

$ benchdiff --old lastmerge ./pkg/sql/tests -b -r 'Sysbench/SQL/3node/oltp_read_write' -d 1000x -c 10

name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    14.0ms ± 2%    14.1ms ± 1%     ~     (p=0.063 n=10+10)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.55MB ± 1%    2.26MB ± 2%  -11.39%  (p=0.000 n=9+9)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     17.2k ± 1%     11.7k ± 0%  -32.30%  (p=0.000 n=10+8)

This pull request introduces experimental support for using drpc to serve BatchRequests. The feature is disabled by default but can be enabled explicitly by setting the COCKROACH_EXPERIMENTAL_DRPC_ENABLED environment variable or automatically in CRDB test builds. The prototype lacks key features, such as authorization checks, interceptors, etc, and is not production-ready.

When DRPC is disabled, sever, muxer and listener mock implementations are injected. If it is enabled real implementations are used. DRPC is only used for BatchRequests. This PR also added support for pooled streaming unary operations similar to #136648.

Added unit tests to ensure:

  • CRDB nodes can server BatchRequests.
  • Simple queries when DRPC is enabled or disabled.

Note: drpc protocol generation is not yet integrated into the build process. Protobuf files were generated manually and committed directly to the repository.

This PR has a few parts:

  • it adds a copy of drpc generated code for a service that supports unary BatchRequest.
  • in rpc.NewServerEx, we now also set up a drpc server and return it up the stack.
  • Registers BatchRequest service with both DRPC and gRPC. If DRPC is enabled, it is
    used to make BatchRequests, if not gRPC is used.
  • in our listener setup, we configure cmux to match on the DRPC!!!1 header and serve the resulting listener on the drpc server. The drpc example uses drpcmigrate.ListenMux instead of cmux; we keep cmux but must make sure the header is read off the connection before delegating (which the drpxmigrate mux would have done for us).
  • if using TLS, wrap the drpc listener with TLS config and use it to servr DRPC requests.
  • add support to reuse drpc streams across unary BatchRequests. However, the DRPC implementation is not on par with the gRPC counterpart in terms of allocation optimizations.

Closes #136794.
Touches #137252.

Epic: CRDB-42584
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Jan 7, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg self-requested a review January 7, 2025 15:13
@tbg tbg added the o-perf-efficiency Related to performance efficiency label Jan 7, 2025
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for putting this up so quickly! Most of my (many) comments are minor. The one meta question I have is around what all of these drpc parts do when drpc is off (i.e. in production). We don't host the server, but in rpc.Context we seem to make lots of drpc-related things anyway. Can we think about what a clean story should look like here? We definitely don't want any ill-fated attempts to establish drpc connections unless opting in has happened, and I don't want to rely on a haphazard set of checks for this. A solution that seems appealing to me is to have a mock implementation of drpcpool.Conn that is returned from dialDRPC if drpc is off. That impl would simply error out.

Something else that may make sense for the prototype is to send heartbeat pings over drpc, if drpc is enabled (drpc otherwise; check in each iteration of the loop). This would solidify the mixed mode by actively re-establishing all connections when the cluster setting is fixed. This will briefly interrupt service, but hey, that's totally fine for a prototype. (For production we'd need a more complicated lifecycle, where the drpc connection can be established separately, later. In fact, we may want to gate drpc behind a different connection class, so that each *Connection is exclusively drpc or grpc, i.e. they never mix; this will result in a clean model).

Together with the mock drpcpool.Conn we'd have the following behavior if the cluster setting is flipped half-way through:

  • setting off, uses grpc normally, all *Connections have a mock drpc conn
  • setting flipped: all connections break (due to errors from the mock drpc conn) and get reestablished right away
  • using drpc

Seems fine for a prototype.

One last (preemptive) request: could you not force-push this PR? Instead, add each change as a fixup commit (I like git commit --fixup :/<someregexpthatmatchesthecommitweultimatelywanttosquashthisinto>) and push it. That way, reviewable stays clean and I can review all changes in the order you make them. Before merging, we can then git rebase -i --autosquash and git diff against the un-rebased version to convince ourselves we're committing the same combined diff we reviewed.

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 4 of 4 files at r5, 6 of 6 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


-- commits line 49 at r5:
Is there a reason you're not introducing the test (which made an ad-hoc drpc client and sent a LeaseInfoRequest) from my prototype here or in the next commit? This would serve as validation that the drpc server setup actually works.


pkg/rpc/peer.go line 257 at r6 (raw file):

			return rpcCtx.grpcDialRaw(ctx, target, class, additionalDialOpts...)
		},
		dialDRPC: func(ctx context.Context, target string) (drpcpool.Conn, error) {

it makes sense to me to move this dial function to <something>_drpc.go. Maybe peer_drpc.go? Or a new dial_drpc.go.


pkg/rpc/peer.go line 441 at r6 (raw file):

}

// TODO(chandrat) Check if this works in mixed mode?

what's mixed mode? Also, I'm unclear whether you are planning to address this TODO in the PR. If it's in this PR, mind adding a markdown checkbox to the PR description instead and removing this TODO?

- [ ] Check if (*peer).runOnce works in mixed mode.


pkg/rpc/peer.go line 1013 at r6 (raw file):

}

type closeEntirePoolConn struct {

ditto for moving to a *_drpc.go file.


pkg/rpc/connection.go line 106 at r6 (raw file):

	// conn could be unhealthy (or there may not even be a conn, i.e. Err() !=
	// nil), if that's what the caller wanted (ConnectNoBreaker).
	return c.connFuture.Conn(), c.connFuture.DRPCConn(), c.connFuture.Err()

what happens here if drpc is disabled?


pkg/rpc/connection.go line 120 at r6 (raw file):

}

func (c *Connection) Connect2(ctx context.Context) (*grpc.ClientConn, drpcpool.Conn, error) {

This needs a comment and possibly a better name.

// Connect2 is like Connect, but returns both a grpc and drpc connection.


pkg/kv/kvpb/api_drpc_hacky.go line 1 at r5 (raw file):

// Code generated by protoc-gen-go-drpc. DO NOT EDIT.

Can you remove this "generated file" marker and replace it with an explanation of how this file was created? Something like:

// This file was created by generating an ad-hoc version of echo_gogo_drpc.go in [1] and changing service names and signatures appropriately.
// This code-gen should be automated at some point as part of productionizing drpc.
//
// [1]: https://github.com/cockroachlabs/perf-efficiency-team/tree/03c57531dcc25e768ecb564427a9d9fc02e4e8f7/rpctoy/toypb


pkg/kv/kvpb/api_drpc_hacky.go line 127 at r5 (raw file):

	switch n {
	case 0:
		return "/cockroach.roachpb.Internal/Batch", drpcEncoding_File_api_proto{},

There's a chance here to change the service names. We were stuck with cockroach.roachpb.Internal for gRPC (the service was originally part of roachpb but this got refactored at some point) but can use cockroach.kv.kvpb.Internal/Batch here.


pkg/server/start_listen.go line 138 at r4 (raw file):

	// Host drpc only if it's _possible_ to turn it on (this requires a test build
	// or env var). If the setting _is_ on, but it should not be possible for it

it is possible! If the env var is set, the setting will be on, because the env var also informs the default (unless I've missed how this works).


pkg/server/start_listen.go line 144 at r4 (raw file):

		rpc.ExperimentalDRPCEnabled.Get(&cfg.Settings.SV)

	// Make a listener that doesn't actually do anything. We will start the dRPC

nit: "If we're not hosting drpc, make a listener that never accepts anything. We will start ..."


pkg/server/start_listen.go line 229 at r4 (raw file):

}

var drpcMatcher = func(reader io.Reader) bool {

Thoughts about extracting all of this drpc-specific stuff to start_listen_drpc.go?


pkg/server/start_listen.go line 134 at r4 (raw file):

	}

	anyL := m.Match(cmux.Any())

not worth git surgery now (seriously 😄 ), but in an ideal world this rename could've been its own commit.


pkg/rpc/peer_drpc.go line 21 at r2 (raw file):

)

var envExperimentalDRPCEnabled = envutil.EnvOrDefaultBool("COCKROACH_EXPERIMENTAL_DRPC_ENABLED", true)

this should be false, right? And then be enabled selectively in tests via a testing knob.


pkg/rpc/peer_drpc.go line 30 at r2 (raw file):

	"rpc.experimental_drpc.enabled",
	"if true, use drpc to execute Batch RPCs (instead of gRPC)",
	metamorphic.ConstantWithTestBool("rpc.experimental_drpc.enabled", envExperimentalDRPCEnabled),

We should probably stay away from this. We don't want to randomly use drpc across the codebase. For example, consider a test that verifies authorization: it will randomly break with drpc. For now, I think we should default this to the env var verbatim without randomization. It does make sense to do a one-off test run at some point to see what tests break under drpc, but this isn't something we'd want to do as part of the prototype.


pkg/rpc/nodedialer/nodedialer.go line 161 at r7 (raw file):

			RestrictedInternalClient: client, // for RangeFeed only
			drpcClient:               kvpb.NewDRPCInternalClient(dconn),
			drpcStreamPool:           drpcBatchStreamPool,

do we do this unconditionally? Note that for grpc we're checking shouldUseBatchStreamPoolClient. We shouldn't use the pool client unless the corresponding setting is set.

I would also leave a comment here that the gRPC version of the stream reuse is currently more allocation-optimized in that the batch stream pool itself implmenents rpc.RestrictedInternalClient, whereas here we allocate a new throw-away unaryDRPCBatchServiceToInternalAdapter. This is fine - but it's worth leaving a note about it. I didn't pursue this more due to the need to keep the original client around for the RangeFeed, which isn't supported on drpc. Once that did support it, I suspect we could avoid the allocation in the same way gRPC does.


pkg/kv/kvpb/BUILD.bazel line 21 at r5 (raw file):

        ":gen-batch-generated",  # keep
        ":gen-errordetailtype-stringer",  # keep
        ":gen-method-stringer",  # keep

why this movement?


pkg/rpc/context.go line 84 at r3 (raw file):

}

type DrpcServer struct {

nit: Go style guide suggests that this should be DRPCServer.


pkg/rpc/context.go line 209 at r3 (raw file):

}

func newDRPCServer(ctx context.Context, rpcCtx *Context) (*DrpcServer, error) {

do you think it makes sense to put this into context_drpc.go?


pkg/rpc/nodedialer/nodedialer.go line 172 at r6 (raw file):

}

type unaryDRPCBatchServiceToInternalAdapter struct {

Would also extract this to a _drpc.go file.


DEPS.bzl line 9229 at r1 (raw file):

        build_file_proto_mode = "disable_global",
        importpath = "github.com/zeebo/errs",
        # TODO: mirror this repo (to fix, run `./dev generate bazel --mirror`)

this still needs to be done (in a fixup commit please, let's not rewrite history in this PR please):

./dev gen bazel --mirror
git add DEPS.bzl
git commit --fixup :/add.drpc.dep
git push

Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvpb/BUILD.bazel line 21 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

why this movement?

Sorry, what was this comment about?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


pkg/kv/kvpb/BUILD.bazel line 21 at r5 (raw file):

Previously, cthumuluru-crdb (Chandra Thumuluru) wrote…

Sorry, what was this comment about?

why did these:

image.png

move up? Was this just spurious cleanup?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


pkg/kv/kvpb/api_drpc_hacky.go line 10 at r9 (raw file):

// 	package cockroach.kv.kvpb;
// 	option go_package = "github.com/cockroachdb/cockroach/pkg/kv/kvpb";
// 	service Batch {

Is the idea that when we productionize this and host drpc for all RPCs, we make different services for the rest? In other words, are you thinking we'll hop on the opportunity to decompose the grab bag that is service Internal1? I'm up for that; consider leaving a comment. I don't see any downsides to doing this, but it has advantages (we get more, smaller, interfaces and so we will have fewer half-implemented versions laying around in testing).

Footnotes

  1. https://github.com/cockroachdb/cockroach/blob/fe56ffe13339201e50c150598006e7c9b10a5e13/pkg/kv/kvpb/api.proto#L3679-L3725

Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/context.go line 209 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do you think it makes sense to put this into context_drpc.go?

I create a new file drpc.go in rpc package and moved it there.


pkg/rpc/peer_drpc.go line 21 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this should be false, right? And then be enabled selectively in tests via a testing knob.

I took this change as-is from the original prototype PR. I was planing to have a similar discussion with you. Fixed it now.


pkg/rpc/peer_drpc.go line 30 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We should probably stay away from this. We don't want to randomly use drpc across the codebase. For example, consider a test that verifies authorization: it will randomly break with drpc. For now, I think we should default this to the env var verbatim without randomization. It does make sense to do a one-off test run at some point to see what tests break under drpc, but this isn't something we'd want to do as part of the prototype.

I took this change as-is from the original prototype PR. I was planing to have a similar discussion with you. Fixed it now.


pkg/rpc/nodedialer/nodedialer.go line 161 at r7 (raw file):

do we do this unconditionally?

Since we would like the entire drpc code paths to be behind a flag, I think it is okay to do it unconditionally for now.

I would also leave a comment here that the gRPC version of the stream reuse is currently more allocation-optimized in that the batch stream pool itself implmenents rpc.RestrictedInternalClient, whereas here we allocate a new throw-away unaryDRPCBatchServiceToInternalAdapter. This is fine - but it's worth leaving a note about it. I didn't pursue this more due to the need to keep the original client around for the RangeFeed, which isn't supported on drpc. Once that did support it, I suspect we could avoid the allocation in the same way gRPC does.

Done


pkg/server/start_listen.go line 134 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

not worth git surgery now (seriously 😄 ), but in an ideal world this rename could've been its own commit.

Yeah, you are right. I wish I had done that in a separate commit.


pkg/server/start_listen.go line 138 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

it is possible! If the env var is set, the setting will be on, because the env var also informs the default (unless I've missed how this works).

Updated the comment. If the environment variable is enable of is the build is a test build, we host DRPC server.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! We're closing in.

Reviewed 2 of 2 files at r11, 2 of 2 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15, 6 of 6 files at r16, 1 of 1 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 1 of 1 files at r20, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


pkg/server/drpc_test.go line 34 at r20 (raw file):

// serves BatchRequest. It doesn't verify that nodes use drpc to communiate with
// each other.
func TestTestClusterDRPC(t *testing.T) {

I regretted that name the moment I typed it, how about TestDRPCBatchServer


pkg/server/drpc_test.go line 63 at r20 (raw file):

			tlsCfg, err := cm.GetNodeClientTLSConfig()

			// manager closed: tls: either ServerName or InsecureSkipVerify must be

This comment could use a few lines of explainer that this is a workaround and we should understand how this should actually work.
Or refer to the other place that does this (in rpc.Context?) and make sure there's a good comment there.
Basically this is something we'd need to look into as we productionize. It's probably something we just got away with in testing since grpc fills in a server for us if none is provided. That is then likely the right thing for us to do as well, rather than setting this flag.

Either way, just something to document for now, not to fix.

Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


-- commits line 49 at r5:

Previously, tbg (Tobias Grieger) wrote…

Is there a reason you're not introducing the test (which made an ad-hoc drpc client and sent a LeaseInfoRequest) from my prototype here or in the next commit? This would serve as validation that the drpc server setup actually works.

Added the test from your draft PR.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r21, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


-- commits line 49 at r5:

Previously, cthumuluru-crdb (Chandra Thumuluru) wrote…

Added the test from your draft PR.

Great. We'll also want one that runs a TestCluster with drpc enabled, to test the dialing through rpc.Context. I don't think I added that in the prototype, since I had drpc on by default and so there were plenty of tests I could use for coverage.

Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/server/drpc_test.go line 34 at r20 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I regretted that name the moment I typed it, how about TestDRPCBatchServer

Done


pkg/server/drpc_test.go line 63 at r20 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment could use a few lines of explainer that this is a workaround and we should understand how this should actually work.
Or refer to the other place that does this (in rpc.Context?) and make sure there's a good comment there.
Basically this is something we'd need to look into as we productionize. It's probably something we just got away with in testing since grpc fills in a server for us if none is provided. That is then likely the right thing for us to do as well, rather than setting this flag.

Either way, just something to document for now, not to fix.

I fixed the test. Can you please check now?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r22, 2 of 2 files at r23, 1 of 1 files at r24, 1 of 1 files at r25, 4 of 4 files at r26, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


pkg/rpc/drpc.go line 28 at r26 (raw file):

type DRPCServer struct {
	Srv    *drpcServerWrapper

I would have expected an interface here instead of the wrapper. Can you give this a try unless you have already and found a roadblock, in which case please explain?

*drpcserver.Server would be one implementation, and type drpcDisabledServer struct{} would be the other implementation.

Btw I think the wrapped noop implementation you have right now does not do what you want, since it does not return errors (despite what the comments claim).


pkg/rpc/drpc.go line 34 at r26 (raw file):

func newDRPCServer(_ context.Context, rpcCtx *Context) (*DRPCServer, error) {
	var dmuxw *drpcMuxWrapper = &drpcMuxWrapper{}

nit: you don't need to specify *drpcMuxWrapper and *drpcServerWrapper below.

Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/drpc.go line 28 at r26 (raw file):
I considered the interface approach too but I didn't see a clear benefit over the wrapper approach. With interfaces we would have to implement one mock and an adapter to DRPC impl. Do you see any downside with wrapper?

Btw I think the wrapped noop implementation you have right now does not do what you want, since it does not return errors (despite what the comments claim).

Hhhm. It was lost in testing. I'll fix ServeOne() to throw an error. Serve() rethrows the error it receives from the mock listener.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r27, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cthumuluru-crdb)


pkg/rpc/drpc.go line 28 at r26 (raw file):

Do you see any downside with wrapper?

It's a little more error prone, leaks into the production stack traces, and has more code. I pushed a commit.

@tbg tbg marked this pull request as ready for review January 13, 2025 11:41
@tbg tbg requested review from a team as code owners January 13, 2025 11:41
Copy link
Contributor Author

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


-- commits line 49 at r5:

Previously, tbg (Tobias Grieger) wrote…

Great. We'll also want one that runs a TestCluster with drpc enabled, to test the dialing through rpc.Context. I don't think I added that in the prototype, since I had drpc on by default and so there were plenty of tests I could use for coverage.

Done.

`drpc` ([github.com/storj/drpc](https://github.com/storj/drpc)) is a lightweight, drop-in replacement to gRPC. Based on the initial benchmark results from the Perf Efficiency team using a simple ping service ([github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy](https://github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy)), switching from gRPC to drpc offers significant performance gains. This commit introduces a dependency on drpc to facilitate faster experimentation and iteration on the prototype.

Epic: None
Release note: None
DRPC is an experimental feature and is not yet ready for production. It
doesn't implement authorization checks. This PR introduces an application
setting that can be activated only in test builds or by explicitly setting
an environment variable to opt in.

Epic: None
Release note: None
Set up the dRPC server and pass it up the stack using the gRPC
server wrapper. Although this approach is somewhat awkward, it
is sufficient for the prototype and can be easily refined in
subsequent commits.

Epic: None
Release note: None
Start the DRPC listener to handle server connections. Depending on
whether the experiment is enabled, the listener will either be a
noop dummy listener or the actual implementation.

Epic: None
Release note: None
Register KV batch and streaming batch methods from internal service with
DRPC server. After these server side changes, when hosting DRPC is
enabled, clients can connect to KV nodes and make DRPC batch and
streaming batch requests. Client changes will be done in a separate
commit.

Epic: None
Release note: None
This commit allows `*rpc.Connection` to also maintain a `drpcpool.Conn`.
When drpc is enabled, clients have a choice to either use gRPC or drpc.

Epic: None
Release note: None
Use mock drpc server and muxer implementation when DRPC is disabled.

Epic: None
Release note: None
@cthumuluru-crdb cthumuluru-crdb changed the title all: add drpc dependency all: host DRPC server and use it for node-node BatchRequests Jan 13, 2025
@cthumuluru-crdb cthumuluru-crdb changed the title all: host DRPC server and use it for node-node BatchRequests all: host a DRPC server and use it for node-node BatchRequests Jan 13, 2025
cthumuluru-crdb and others added 2 commits January 14, 2025 09:25
It wasn't actually exercising drpc. This is because by the time the env var is
overridden, the previous value has already been picked up by the cluster
setting.

Changed the test to override the setting instead.

Also, removed the drpc-off run of the test: all other tests already run with
drpc off, so and the test takes a second or two, so not worth it. Instead, added
a secure flavor.

This secure flavor promptly failed (once drpc was actually enabled) because it
needed the same hack as the original prototype[1], which I then proceeded to
add.

[1]: https://github.com/cockroachdb/cockroach/pull/136794/files#diff-fc54146a422d3d59215aae7d10ecb55771db57f0f49b0b0fa34cd64358979ca0R293
@tbg tbg force-pushed the chandrat-drpc-proto branch from 3b07ca5 to b0ea57c Compare January 14, 2025 09:01
@tbg
Copy link
Member

tbg commented Jan 14, 2025

benchdiff shows a regression when drpc is not enabled (on time/op, the allocations look slightly higher though it's inconclusive):

I suggest we merge this anyway and sort this out after the branch cut. Likely we've slipped in a few allocations. CPU and alloc profile attached.

benchdiff.tgz

old:  f9df57e Merge #137984
new:  b0ea57c server: fix drpc end-to-end test
args: benchdiff "--old" "lastmerge" "./pkg/sql/tests" "-b" "-r" "Sysbench/SQL/3node/oltp_read_write" "-d" "1000x" "-c" "10"

name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    11.8ms ± 1%    11.9ms ± 1%  +0.93%  (p=0.001 n=9+10)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.19MB ± 2%    2.20MB ± 3%    ~     (p=0.529 n=10+10)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     10.8k ± 2%     10.9k ± 2%    ~     (p=0.184 n=10+10)

edit: I looked at both the mem and CPU profiles and I'm not seeing any unexpected allocations in the rpc package. I also vetted the BatchStream Node endpoint refactor and am not seeing unexpected overhead there either. In short, it's a bit of a mystery at this point.

One interesting view is

pprof -http :6060 -diff_base benchdiff/f9df57e/artifacts/mem_Sysbench_SQL_3node_oltp_read_write.prof -normalize benchdiff/b0ea57c/artifacts/mem_Sysbench_SQL_3node_oltp_read_write.

which compares allocations before and after the change (red are new allocs, green are old allocs). We can see that the main difference that sticks out is the refactor that allows both the drpc and grpc streaming batch server to be implemented by the same kvserver.Node function. I've checked the source output of the new profile and I don't see new allocations there though, so the visual seems to be due to the changes in stacks only.

image

These also have some amount of noise in them (it's from a single run of each), though the runs overeall indicate we're allocating a little more in the new code. I'm unsure what to make of that.

@cthumuluru-crdb
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 14, 2025

@craig craig bot merged commit f4ad047 into cockroachdb:master Jan 14, 2025
21 of 22 checks passed
@tbg
Copy link
Member

tbg commented Jan 14, 2025

Ran 20x to see if the result changes, but still, solid lead for master without this code.

old:  f9df57e Merge #137984
new:  b0ea57c server: fix drpc end-to-end test
args: benchdiff "-o" "lastmerge" "-p" "2025-01-14T12_06_34Z"

Found previous run; old=benchdiff/f9df57e/artifacts/out.2025-01-14T12_06_34Z, new=benchdiff/b0ea57c/artifacts/out.2025-01-14T12_06_34Z
name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    12.1ms ± 1%    12.2ms ± 1%  +1.16%  (p=0.000 n=20+20)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.20MB ± 3%    2.20MB ± 3%    ~     (p=0.799 n=20+20)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     10.9k ± 2%     10.9k ± 2%    ~     (p=0.417 n=20+20)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
o-perf-efficiency Related to performance efficiency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants