runtime: aggressive gc assist with many goroutines

_This is an extraction of a private thread with @mknyszek._

While working with a customer that is running CockroachDB, we found a class of workloads that are severely (and reproducibly) impacted by GC assist. The situation occurs when an app opens many (O(10k)) SQL connections to CockroachDB and applies a moderately memory-intensive workload (O(1KB) reads and writes @ 30k qps). We've found that this leads to severe tail latency blips (p50=2ms, p99=70ms) and have pinpointed the effect to GC assist.

This effect is present with `go1.17.11` and `go1.19.1`. It is also present on `master` with https://github.com/golang/go/commit/41795528b0132170ec70a8e8ce0bcdb2e972e074. However, the degraded tail latency disappears if GC assist is disabled by commenting out [this line](https://github.com/golang/go/blob/558785a0a9df5fbb7e9617c05059cf2892884620/src/runtime/malloc.go#L906).

```
name \ p99(ms)  17.txt     17-noassist.txt  19.txt     19-noassist.txt  19-4179552.txt
kv95            54.5 ±46%         2.5 ± 0%  73.4 ±43%         3.9 ±96%       41.9 ± 5%
```

Increasing GOGC does improve tail latency. However, the improvement comes from running fewer GCs. When GC was running, the impact on tail latency appears to be about the same.

```
name \ p99(ms)  17-gogc-300.txt  19-gogc-300.txt  19-4179552-gogc-300.txt
kv95                  44.2 ±47%        18.8 ±24%               22.6 ±131%
```

Go execution traces show GC assist kick in across workload goroutines almost immediately (within a few ms) after the background GC process starts. It then consumes the majority of on-cpu time on these goroutines for the duration of the background GC duration.

<img width="1724" alt="Screen Shot 2022-11-28 at 4 40 28 PM" src="https://user-images.githubusercontent.com/5438456/204386786-5558b535-f9ff-47f0-8fe9-29981d4a989e.png">

Here is a collection of gctraces from different variants of the test using `GODEBUG=gctrace=1,gcpacertrace=1`:

[go1.17_gogc_100.txt](https://github.com/golang/go/files/10107644/go1.17_gogc_100.txt)
[go1.17_gogc_300.txt](https://github.com/golang/go/files/10107652/go1.17_gogc_300.txt)
[go1.17_gogc_900.txt](https://github.com/golang/go/files/10107653/go1.17_gogc_900.txt)
[go1.19_gogc_100.txt](https://github.com/golang/go/files/10107657/go1.19_gogc_100.txt)
[go1.19_gogc_300.txt](https://github.com/golang/go/files/10107658/go1.19_gogc_300.txt)
[go1.19_gogc_900.txt](https://github.com/golang/go/files/10107659/go1.19_gogc_900.txt)

An interesting note is that the investigations began when we noticed higher tail latency when the same workload (30k qps) was split across more SQL connections. CockroachDB maintains two goroutines per active connection. An early test found the following:

| vCPUs          | Go GC             | SQL connections | p99 latency |
| -------------- | ----------------- | --------------- | ----------- |
| 30 (1 socket)  | Default           | 512             | 2.0         |
|                |                   | 10,000          | 28.3        |
| 60 (2 sockets) | Default           | 512             | 3.1         |
|                |                   | 10,000          | 67.1        |

### To reproduce

The easiest way to reproduce is using CockroachDB's internal cluster orchestration tool called [`roachprod`](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/roachprod). With that tool, the reprudction steps are:
```
export CLUSTER="${USER}-test"
roachprod create $CLUSTER -n2 --gce-machine-type='c2-standard-30' --local-ssd=false
roachprod stage $CLUSTER release v22.2.0-rc.3
roachprod start $CLUSTER:1
roachprod run   $CLUSTER:2 -- "./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m {pgurl:1}"
```

If roachprod is not an option, than the steps are:
1. create a pair of `c2-standard-30` VM instances
2. stage a [CockroachDB binary](https://www.cockroachlabs.com/docs/releases/v22.2#v22-2-0-rc-3) on each
3. [start CockroachDB](https://www.cockroachlabs.com/docs/dev/start-a-local-cluster.html) on the first VM
4. run the following from the second VM:
```
./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m 'postgresql://root@<INSERT VM1 HOSTNAME HERE>:26257?sslmode=disable'
```

I'm happy to help get these reproduction steps working in other environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: aggressive gc assist with many goroutines #56966

To reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vCPUs	Go GC	SQL connections	p99 latency
30 (1 socket)	Default	512	2.0
		10,000	28.3
60 (2 sockets)	Default	512	3.1
		10,000	67.1

runtime: aggressive gc assist with many goroutines #56966

Description

To reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions