Skip to content

runtime: new goroutines can spend excessive time in morestack #18138

Open
@petermattis

Description

@petermattis

What version of Go are you using (go version)?

go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64 a.k.a go1.8beta1

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in runtime.morestack. The problematic goroutines are all hitting a single gRPC entrypoint Server.Batch and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.

The expensive part of runtime.morestack is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:

name               old time/op  new time/op  delta
KVInsert1_SQL-8     339µs ± 2%   312µs ± 1%   -7.89%  (p=0.000 n=10+10)
KVInsert10_SQL-8    485µs ± 2%   471µs ± 1%   -2.81%  (p=0.000 n=10+10)
KVInsert100_SQL-8  1.36ms ± 0%  1.35ms ± 0%   -0.95%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     535µs ± 1%   487µs ± 1%   -9.02%   (p=0.000 n=10+9)
KVUpdate10_SQL-8    777µs ± 1%   730µs ± 1%   -6.03%   (p=0.000 n=10+9)
KVUpdate100_SQL-8  2.69ms ± 1%  2.66ms ± 1%   -1.16%  (p=0.000 n=10+10)
KVDelete1_SQL-8     479µs ± 1%   429µs ± 2%  -10.43%   (p=0.000 n=9+10)
KVDelete10_SQL-8    676µs ± 1%   637µs ± 1%   -5.80%    (p=0.000 n=9+9)
KVDelete100_SQL-8  2.23ms ± 5%  2.18ms ± 4%     ~     (p=0.105 n=10+10)
KVScan1_SQL-8       216µs ± 5%   179µs ± 1%  -17.12%  (p=0.000 n=10+10)
KVScan10_SQL-8      233µs ± 1%   201µs ± 1%  -13.76%  (p=0.000 n=10+10)
KVScan100_SQL-8     463µs ± 1%   437µs ± 0%   -5.64%   (p=0.000 n=10+8)

old are benchmarks gathered using go1.8beta1 and new are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of server.Batch to a growStack method:

var growStackGlobal = false

//go:noinline
func growStack() {
	// Goroutine stacks currently start at 2 KB in size. The code paths through
	// the storage package often need a stack that is 32 KB in size. The stack
	// growth is mildly expensive making it useful to trick the runtime into
	// growing the stack early. Since goroutine stacks grow in multiples of 2 and
	// start at 2 KB in size, by placing a 16 KB object on the stack early in the
	// lifetime of a goroutine we force the runtime to use a 32 KB stack for the
	// goroutine.
	var buf [16 << 10] /* 16 KB */ byte
	if growStackGlobal {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
	}
}

The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the growStackGlobal is not currently necessary, but I wanted to future proof against the compiler deciding that buf is not necessary.

Longer term, the stack usage under server.Batch should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions