rpc: consider adopting drpc (12% improvement on sysbench oltp_read_write)

In the wake of disappointing gRPC performance (https://github.com/cockroachdb/cockroach/pull/136278#issuecomment-2507826716), we've started looking at alternatives.

drpc performed much better than grpc across the board in microbenchmarks[^1].

We then prototyped drpc for our critical BatchRequest-BatchResponse path in #136794. This prototype does not port the gRPC interceptors and custom codec to drpc (so we don't authorize drpc requests or use snappy compression), so the drpc numbers should be _slightly_ discounted. In light of the definite win in the microbenchmarks though, it is likely that the comparison is generally valid.

The screenshot below compares the performance of grpc vs drpc on CockroachDB running ten-minute 3x8vcpu sysbench oltp_read_write workloads (100m rows total across 10 tables, with a number of cluster settings tuned to reduce variance).

`grpc-pool` is a recent significant improvement gRPC performance improvement achieved by re-using streams (#136648) and is now CockroachDB's default. `drpc` uses `drpc` instead of `grpc` but no stream pooling; `drpc-pool` then adopts the optimization in #136648 as well. We can see that adopting drpc (including the streaming optimization) can be expected to boost qps by up to 12% on this workload.

<img width="256" alt="image" src="https://github.com/user-attachments/assets/265c68f5-e360-4a92-81ed-19ba4801cd03">

[^1]: https://github.com/cockroachlabs/perf-efficiency-team/pull/19#issue-2716024819

This is a very significant improvement and so we should seriously consider moving off gRPC to a more performant option, drpc being the current lead candidate. Despite drpc showing promise, we should also evaluate other options (frpc seemed "too alpha" and not as great a fit, connect tries hard to be a drop-in replacement for grpc but it doesn't seem focused on performance either, forking grpc would likely be more work than extending drpc as needed).

## DRPC

[drpc](https://storj.github.io/drpc/) is a project by [storj](https://www.storj.io/) but in practice appears to be primarily driven by a single author, meaning there's a much less robust community behind it (vs grpc). It's <4000 lines of code in an approachable, high-quality codebase, with only infrequent changes. There isn't much downside risk to adopting this codebase, except if we find that we need features that add significant complexity. One notable feature is the absence of TCP connection sharing in drpc (i.e. can't multiplex multiple clients over a single TCP conn). I'm not sure this is actually a downside or an immediate deal-breaker, but it is certainly something we have to evaluate as it would lead to many more TCP connections being held active by any individual node.

The prototype hosts drpc and grpc side by side, so there is a clear migration path.

We'd need to find out what to do about grpc-gateway - likely we would want to remove it; drpc has a simple http-serving facility as well, see https://github.com/storj/drpc/blob/main/examples/drpc_and_http/server/main.go.

List of known limitations in the prototype/items that would be required to remove grpc:

- [ ] codegen is manual; need to properly implement drpc codegen into our bazel setup (current generated code was copy-pasted and edited from [rpctoy](https://github.com/cockroachlabs/perf-efficiency-team/tree/main/rpctoy); see `api_drpc_hacky.go` in the prototype PR)
- [ ] drpc only registers BatchRequest server (should be straightforward to fix after codegen)
- [ ] extend drpc's gogoproto codec with `MarshalAppend` as done in https://github.com/cockroachlabs/perf-efficiency-team/pull/36#issuecomment-2547916166.
- [ ] drpc conn pool needs to scale down more smartly.
- [ ] drpc not fully integrated in rpc heartbeats; these currently use grpc always. (Need to make it dependent on cluster setting, i.e. achieve parity between the two). 
- [ ] need to set `tlsConfig.InsecureSkipVerify`; check that this is intentional (see `TestTestClusterDRPC` and `(*peer).dial`)
- [ ] grpc server interceptors not implemented for drpc server[^srvinterc]
  - [ ] need to switch to drpc's metadata mechanism[^drpcmeta]
- [ ] grpc client interceptors not implemented (need to wrap the created clients directly)
- [ ] port the existing metrics (including some cross-region metrics; maybe not all can be ported verbatim).
- [ ] migration path to remove grpc-gateway
- [ ] drpc connections are not torn down on client write failures[^drpcwritefail]; we may need to improve this for failure detection
- [ ] need to evaluate whether drpc's one-tcp-conn-per-stream model is acceptable or whether conn sharing needs to be implemented
- [ ] there's likely a bit of annoying work around grpc errors and error codes, i.e. what are the drpc corresponding versions of [this](https://github.com/cockroachdb/cockroach/blob/991262ed7e064a13b1cbbd4bb4d04c982f888921/pkg/cli/node.go#L377) and maybe [this](https://github.com/cockroachdb/cockroach/blob/991262ed7e064a13b1cbbd4bb4d04c982f888921/pkg/util/grpcutil/grpc_util.go#L160-L176) will change too; need to look in more detail but it's expected that something will break around rpc error detection since a lot of this was customized for what we saw with grpc.
- [ ] context deadline/cancellation propagation across drpc boundaries
- [ ] the failover roachtest suite should not regress.
- [ ] add a cluster setting that _disables_ the grpc server for testing
- [ ] the analysis has focused on microbenchmarks (across various payload sizes) and sysbench-on-CRDB (which has fairly small payloads). A wider round of benchmarking is advisable.
- [ ] `TestDRPCSelectQuery` (introduced in #138368) doesn't check that drpc is actually used.
- [ ] the supposed performance regression [on the PR] was not explained.
- [ ] `attemptJoinTo` and maybe kvconnector code that directly uses gRPC will need to be made more flexible.

[on the PR]: https://github.com/cockroachdb/cockroach/pull/138368#issuecomment-2589413243

[^drpcmeta]: https://github.com/storj/drpc/blob/0075ac8716613e0869c41714a92cdaacf761416d/drpcmetadata/README.md
[^drpcwritefail]: https://github.com/cockroachlabs/perf-efficiency-team/blob/100ca12e298d6b92d062005bba8cdeecb212d65a/rpctoy/drpc_test.go#L102-L118; we can wrap the conn into something that closes the underlying tcp conn when RPCs return in errors, so this could be fixed without a fork.

[^srvinterc]: see https://github.com/cockroachdb/cockroach/blob/fb2b661c8a15215c330b079a0a8e421184be29da/pkg/rpc/context.go#L94

Epic: CRDB-42584

Jira issue: CRDB-45491

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc: consider adopting drpc (12% improvement on sysbench oltp_read_write) #137252

DRPC

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rpc: consider adopting drpc (12% improvement on sysbench oltp_read_write) #137252

Description

DRPC

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions