runtime: new goroutines can spend excessive time in morestack

### What version of Go are you using (`go version`)?

`go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64` a.k.a `go1.8beta1`

### What operating system and processor architecture are you using (`go env`)?

```
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
```

### What did you do?

A recent [change](https://github.com/cockroachdb/cockroach/pull/11196) to [github.com/cockroachdb/cockroach](http://github.com/cockroachdb/cockroach) replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in `runtime.morestack`. The problematic goroutines are all hitting a single gRPC entrypoint [`Server.Batch`](https://github.com/cockroachdb/cockroach/blob/master/pkg/server/node.go#L845) and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.

The expensive part of `runtime.morestack` is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:

```
name               old time/op  new time/op  delta
KVInsert1_SQL-8     339µs ± 2%   312µs ± 1%   -7.89%  (p=0.000 n=10+10)
KVInsert10_SQL-8    485µs ± 2%   471µs ± 1%   -2.81%  (p=0.000 n=10+10)
KVInsert100_SQL-8  1.36ms ± 0%  1.35ms ± 0%   -0.95%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     535µs ± 1%   487µs ± 1%   -9.02%   (p=0.000 n=10+9)
KVUpdate10_SQL-8    777µs ± 1%   730µs ± 1%   -6.03%   (p=0.000 n=10+9)
KVUpdate100_SQL-8  2.69ms ± 1%  2.66ms ± 1%   -1.16%  (p=0.000 n=10+10)
KVDelete1_SQL-8     479µs ± 1%   429µs ± 2%  -10.43%   (p=0.000 n=9+10)
KVDelete10_SQL-8    676µs ± 1%   637µs ± 1%   -5.80%    (p=0.000 n=9+9)
KVDelete100_SQL-8  2.23ms ± 5%  2.18ms ± 4%     ~     (p=0.105 n=10+10)
KVScan1_SQL-8       216µs ± 5%   179µs ± 1%  -17.12%  (p=0.000 n=10+10)
KVScan10_SQL-8      233µs ± 1%   201µs ± 1%  -13.76%  (p=0.000 n=10+10)
KVScan100_SQL-8     463µs ± 1%   437µs ± 0%   -5.64%   (p=0.000 n=10+8)
```

`old` are benchmarks gathered using go1.8beta1 and `new` are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of `server.Batch` to a `growStack` method:

```
var growStackGlobal = false

//go:noinline
func growStack() {
	// Goroutine stacks currently start at 2 KB in size. The code paths through
	// the storage package often need a stack that is 32 KB in size. The stack
	// growth is mildly expensive making it useful to trick the runtime into
	// growing the stack early. Since goroutine stacks grow in multiples of 2 and
	// start at 2 KB in size, by placing a 16 KB object on the stack early in the
	// lifetime of a goroutine we force the runtime to use a 32 KB stack for the
	// goroutine.
	var buf [16 << 10] /* 16 KB */ byte
	if growStackGlobal {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
	}
}
```

The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the `growStackGlobal` is not currently necessary, but I wanted to future proof against the compiler deciding that `buf` is not necessary. 

Longer term, the stack usage under `server.Batch` should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: new goroutines can spend excessive time in morestack #18138

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

What did you do?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: new goroutines can spend excessive time in morestack #18138

Description

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?