Description
What version of Go are you using (go version
)?
go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64
a.k.a go1.8beta1
What operating system and processor architecture are you using (go env
)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
What did you do?
A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in runtime.morestack
. The problematic goroutines are all hitting a single gRPC entrypoint Server.Batch
and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.
The expensive part of runtime.morestack
is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:
name old time/op new time/op delta
KVInsert1_SQL-8 339µs ± 2% 312µs ± 1% -7.89% (p=0.000 n=10+10)
KVInsert10_SQL-8 485µs ± 2% 471µs ± 1% -2.81% (p=0.000 n=10+10)
KVInsert100_SQL-8 1.36ms ± 0% 1.35ms ± 0% -0.95% (p=0.000 n=10+10)
KVUpdate1_SQL-8 535µs ± 1% 487µs ± 1% -9.02% (p=0.000 n=10+9)
KVUpdate10_SQL-8 777µs ± 1% 730µs ± 1% -6.03% (p=0.000 n=10+9)
KVUpdate100_SQL-8 2.69ms ± 1% 2.66ms ± 1% -1.16% (p=0.000 n=10+10)
KVDelete1_SQL-8 479µs ± 1% 429µs ± 2% -10.43% (p=0.000 n=9+10)
KVDelete10_SQL-8 676µs ± 1% 637µs ± 1% -5.80% (p=0.000 n=9+9)
KVDelete100_SQL-8 2.23ms ± 5% 2.18ms ± 4% ~ (p=0.105 n=10+10)
KVScan1_SQL-8 216µs ± 5% 179µs ± 1% -17.12% (p=0.000 n=10+10)
KVScan10_SQL-8 233µs ± 1% 201µs ± 1% -13.76% (p=0.000 n=10+10)
KVScan100_SQL-8 463µs ± 1% 437µs ± 0% -5.64% (p=0.000 n=10+8)
old
are benchmarks gathered using go1.8beta1 and new
are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of server.Batch
to a growStack
method:
var growStackGlobal = false
//go:noinline
func growStack() {
// Goroutine stacks currently start at 2 KB in size. The code paths through
// the storage package often need a stack that is 32 KB in size. The stack
// growth is mildly expensive making it useful to trick the runtime into
// growing the stack early. Since goroutine stacks grow in multiples of 2 and
// start at 2 KB in size, by placing a 16 KB object on the stack early in the
// lifetime of a goroutine we force the runtime to use a 32 KB stack for the
// goroutine.
var buf [16 << 10] /* 16 KB */ byte
if growStackGlobal {
// Make sure the compiler doesn't optimize away buf.
for i := range buf {
buf[i] = byte(i)
}
}
}
The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the growStackGlobal
is not currently necessary, but I wanted to future proof against the compiler deciding that buf
is not necessary.
Longer term, the stack usage under server.Batch
should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.