Skip to content

runtime: aggressive gc assist with many goroutines #56966

Open
@nvb

Description

@nvb

This is an extraction of a private thread with @mknyszek.

While working with a customer that is running CockroachDB, we found a class of workloads that are severely (and reproducibly) impacted by GC assist. The situation occurs when an app opens many (O(10k)) SQL connections to CockroachDB and applies a moderately memory-intensive workload (O(1KB) reads and writes @ 30k qps). We've found that this leads to severe tail latency blips (p50=2ms, p99=70ms) and have pinpointed the effect to GC assist.

This effect is present with go1.17.11 and go1.19.1. It is also present on master with 4179552. However, the degraded tail latency disappears if GC assist is disabled by commenting out this line.

name \ p99(ms)  17.txt     17-noassist.txt  19.txt     19-noassist.txt  19-4179552.txt
kv95            54.5 ±46%         2.5 ± 0%  73.4 ±43%         3.9 ±96%       41.9 ± 5%

Increasing GOGC does improve tail latency. However, the improvement comes from running fewer GCs. When GC was running, the impact on tail latency appears to be about the same.

name \ p99(ms)  17-gogc-300.txt  19-gogc-300.txt  19-4179552-gogc-300.txt
kv95                  44.2 ±47%        18.8 ±24%               22.6 ±131%

Go execution traces show GC assist kick in across workload goroutines almost immediately (within a few ms) after the background GC process starts. It then consumes the majority of on-cpu time on these goroutines for the duration of the background GC duration.

Screen Shot 2022-11-28 at 4 40 28 PM

Here is a collection of gctraces from different variants of the test using GODEBUG=gctrace=1,gcpacertrace=1:

go1.17_gogc_100.txt
go1.17_gogc_300.txt
go1.17_gogc_900.txt
go1.19_gogc_100.txt
go1.19_gogc_300.txt
go1.19_gogc_900.txt

An interesting note is that the investigations began when we noticed higher tail latency when the same workload (30k qps) was split across more SQL connections. CockroachDB maintains two goroutines per active connection. An early test found the following:

vCPUs Go GC SQL connections p99 latency
30 (1 socket) Default 512 2.0
10,000 28.3
60 (2 sockets) Default 512 3.1
10,000 67.1

To reproduce

The easiest way to reproduce is using CockroachDB's internal cluster orchestration tool called roachprod. With that tool, the reprudction steps are:

export CLUSTER="${USER}-test"
roachprod create $CLUSTER -n2 --gce-machine-type='c2-standard-30' --local-ssd=false
roachprod stage $CLUSTER release v22.2.0-rc.3
roachprod start $CLUSTER:1
roachprod run   $CLUSTER:2 -- "./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m {pgurl:1}"

If roachprod is not an option, than the steps are:

  1. create a pair of c2-standard-30 VM instances
  2. stage a CockroachDB binary on each
  3. start CockroachDB on the first VM
  4. run the following from the second VM:
./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m 'postgresql://root@<INSERT VM1 HOSTNAME HERE>:26257?sslmode=disable'

I'm happy to help get these reproduction steps working in other environments.

Metadata

Metadata

Assignees

Labels

NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions