Skip to content

rpc: consider adopting drpc (12% improvement on sysbench oltp_read_write) #137252

Open
@tbg

Description

@tbg

In the wake of disappointing gRPC performance (#136278 (comment)), we've started looking at alternatives.

drpc performed much better than grpc across the board in microbenchmarks1.

We then prototyped drpc for our critical BatchRequest-BatchResponse path in #136794. This prototype does not port the gRPC interceptors and custom codec to drpc (so we don't authorize drpc requests or use snappy compression), so the drpc numbers should be slightly discounted. In light of the definite win in the microbenchmarks though, it is likely that the comparison is generally valid.

The screenshot below compares the performance of grpc vs drpc on CockroachDB running ten-minute 3x8vcpu sysbench oltp_read_write workloads (100m rows total across 10 tables, with a number of cluster settings tuned to reduce variance).

grpc-pool is a recent significant improvement gRPC performance improvement achieved by re-using streams (#136648) and is now CockroachDB's default. drpc uses drpc instead of grpc but no stream pooling; drpc-pool then adopts the optimization in #136648 as well. We can see that adopting drpc (including the streaming optimization) can be expected to boost qps by up to 12% on this workload.

image

This is a very significant improvement and so we should seriously consider moving off gRPC to a more performant option, drpc being the current lead candidate. Despite drpc showing promise, we should also evaluate other options (frpc seemed "too alpha" and not as great a fit, connect tries hard to be a drop-in replacement for grpc but it doesn't seem focused on performance either, forking grpc would likely be more work than extending drpc as needed).

DRPC

drpc is a project by storj but in practice appears to be primarily driven by a single author, meaning there's a much less robust community behind it (vs grpc). It's <4000 lines of code in an approachable, high-quality codebase, with only infrequent changes. There isn't much downside risk to adopting this codebase, except if we find that we need features that add significant complexity. One notable feature is the absence of TCP connection sharing in drpc (i.e. can't multiplex multiple clients over a single TCP conn). I'm not sure this is actually a downside or an immediate deal-breaker, but it is certainly something we have to evaluate as it would lead to many more TCP connections being held active by any individual node.

The prototype hosts drpc and grpc side by side, so there is a clear migration path.

We'd need to find out what to do about grpc-gateway - likely we would want to remove it; drpc has a simple http-serving facility as well, see https://github.com/storj/drpc/blob/main/examples/drpc_and_http/server/main.go.

List of known limitations in the prototype/items that would be required to remove grpc:

  • codegen is manual; need to properly implement drpc codegen into our bazel setup (current generated code was copy-pasted and edited from rpctoy; see api_drpc_hacky.go in the prototype PR)
  • drpc only registers BatchRequest server (should be straightforward to fix after codegen)
  • extend drpc's gogoproto codec with MarshalAppend as done in https://github.com/cockroachlabs/perf-efficiency-team/pull/36#issuecomment-2547916166.
  • drpc conn pool needs to scale down more smartly.
  • drpc not fully integrated in rpc heartbeats; these currently use grpc always. (Need to make it dependent on cluster setting, i.e. achieve parity between the two).
  • need to set tlsConfig.InsecureSkipVerify; check that this is intentional (see TestTestClusterDRPC and (*peer).dial)
  • grpc server interceptors not implemented for drpc server2
    • need to switch to drpc's metadata mechanism3
  • grpc client interceptors not implemented (need to wrap the created clients directly)
  • port the existing metrics (including some cross-region metrics; maybe not all can be ported verbatim).
  • migration path to remove grpc-gateway
  • drpc connections are not torn down on client write failures4; we may need to improve this for failure detection
  • need to evaluate whether drpc's one-tcp-conn-per-stream model is acceptable or whether conn sharing needs to be implemented
  • there's likely a bit of annoying work around grpc errors and error codes, i.e. what are the drpc corresponding versions of this and maybe this will change too; need to look in more detail but it's expected that something will break around rpc error detection since a lot of this was customized for what we saw with grpc.
  • context deadline/cancellation propagation across drpc boundaries
  • the failover roachtest suite should not regress.
  • add a cluster setting that disables the grpc server for testing
  • the analysis has focused on microbenchmarks (across various payload sizes) and sysbench-on-CRDB (which has fairly small payloads). A wider round of benchmarking is advisable.
  • TestDRPCSelectQuery (introduced in all: host a DRPC server and use it for node-node BatchRequests #138368) doesn't check that drpc is actually used.
  • the supposed performance regression on the PR was not explained.
  • attemptJoinTo and maybe kvconnector code that directly uses gRPC will need to be made more flexible.

Epic: CRDB-42584

Jira issue: CRDB-45491

Footnotes

  1. https://github.com/cockroachlabs/perf-efficiency-team/pull/19#issue-2716024819

  2. see https://github.com/cockroachdb/cockroach/blob/fb2b661c8a15215c330b079a0a8e421184be29da/pkg/rpc/context.go#L94

  3. https://github.com/storj/drpc/blob/0075ac8716613e0869c41714a92cdaacf761416d/drpcmetadata/README.md

  4. https://github.com/cockroachlabs/perf-efficiency-team/blob/100ca12e298d6b92d062005bba8cdeecb212d65a/rpctoy/drpc_test.go#L102-L118; we can wrap the conn into something that closes the underlying tcp conn when RPCs return in errors, so this could be fixed without a fork.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-performancePerf of queries or internals. Solution not expected to change functional behavior.P-2Issues/test failures with a fix SLA of 3 monthsbranch-masterFailures and bugs on the master branch.o-perf-efficiencyRelated to performance efficiency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions