-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
Go version
go version go1.22.5 linux/amd64
Output of go env in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/zeroshade/.cache/go-build'
GOENV='/home/zeroshade/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/zeroshade/go/pkg/mod'
GONOPROXY='github.com/voltrondata/*'
GONOSUMDB='github.com/voltrondata/*'
GOOS='linux'
GOPATH='/home/zeroshade/go'
GOPRIVATE='github.com/voltrondata/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.22'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.22/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.5'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/zeroshade/Projects/GitHub/arrow-adbc/go/adbc/go.mod'
GOWORK='/home/zeroshade/Projects/GitHub/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build279063330=/tmp/go-build -gno-record-gcc-switches'What did you do?
Building a shared library with Go that is loaded and called from C. I was able to reproduce a trivial example with the following:
package main
import "C"
//export trivialFunctionCall
func trivialFunctionCall(v uintptr) C.int {
return C.int(v)
}
func main() {}Build with go build -o libtrivial.so -buildmode=c-shared main.go
#include "libtrivial.h"
#include <stdio.h>
int main() {
int a = 0;
for (int i = 0; i < 500; ++i) {
a = trivialFunctionCall((size_t)i);
}
printf("Done %d\n", a);
}Build with gcc -o trivial trivial.c -ltrivial -L. -Wl,-rpath,..
Then we run and record the performance: perf record --call-graph lbr -- ./trivial and look at the report with perf report -g.
What did you see happen?
The overwhelming amount of time in the cgocall is spent in __isoc99_sscanf as seen here:
+ 97.82% 0.00% trivial trivial [.] _start
+ 97.82% 0.00% trivial libc.so.6 [.] __libc_start_main
+ 97.82% 0.00% trivial libc.so.6 [.] 0x0000729379829d8e
+ 97.82% 0.00% trivial trivial [.] main
- 97.82% 0.00% trivial libtrivial.so [.] trivialFunctionCall
trivialFunctionCall
crosscall2
runtime.cgocallback.abi0
runtime.cgocallbackg.abi0
- runtime.cgocallbackg
- 96.56% runtime.callbackUpdateSystemStack
runtime.asmcgocall.abi0
- x_cgo_getstackbound
- 94.01% pthread_getattr_np
+ 76.00% __isoc99_sscanf
+ 9.14% __getdelim
2.96% getline
+ 2.06% fopen
+ 1.97% fclose
+ 1.88% cfree
2.56% pthread_getattr_np@plt
+ 1.25% runtime.exitsyscall
+ 97.82% 0.00% trivial libtrivial.so [.] crosscall2
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallback.abi0
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallbackg.abi0
+ 97.82% 0.00% trivial libtrivial.so [.] runtime.cgocallbackg
+ 96.56% 2.56% trivial libc.so.6 [.] pthread_getattr_np
+ 96.56% 0.00% trivial libtrivial.so [.] runtime.callbackUpdateSystemStack
+ 96.56% 0.00% trivial libtrivial.so [.] runtime.asmcgocall.abi0
+ 96.56% 0.00% trivial libtrivial.so [.] x_cgo_getstackbound
+ 76.00% 0.00% trivial libc.so.6 [.] __isoc99_sscanf
+ 42.52% 0.00% trivial libc.so.6 [.] 0x00007293798629e0
+ 16.45% 0.00% trivial libc.so.6 [.] 0x0000729379865265
+ 16.45% 0.00% trivial libc.so.6 [.] __strtoull_internal
+ 10.59% 3.14% trivial libc.so.6 [.] __getdelim
If we instead build using GOTOOLCHAIN=go1.22.4 go build -o libtrivial.so -buildmode=c-shared main.go and re-run it, you see that the cost of callbackUpdateSystemStack and pthread_getattr_np don't even register in the performance trace:
+ 92.66% 92.66% trivial libtrivial.so [.] runtime.execute
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.mallocgc
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.gcTrigger.test
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.morestack.abi0
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.newstack
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.goschedImpl
+ 92.66% 0.00% trivial libtrivial.so [.] runtime.schedule
+ 7.34% 0.00% trivial libtrivial.so [.] runtime.sigtramp.abi0
+ 7.34% 0.00% trivial libtrivial.so [.] runtime.sigtrampgo
+ 6.78% 6.78% trivial libtrivial.so [.] runtime.sighandler
+ 0.56% 0.56% trivial libtrivial.so [.] runtime.sigfwdgo
0.00% 0.00% trivial libtrivial.so [.] runtime.cgoSigtramp.abi0
The problem was introduced by 3560cf0afb which forces all calls to cgo to update the system stack bounds. Ultimately the problem traces down to the glibc implementation of pthread_getattr_np which is called by
go/src/runtime/cgo/gcc_stack_unix.c
Line 28 in 792a261
| pthread_getattr_np(pthread_self(), &attr); // GNU extension |
In glibc, pthread_getattr_np ends up directly parsing /proc/self/maps by using sscanf, hence the high cost as seen here https://github.com/lattera/glibc/blob/master/nptl/pthread_getattr_np.c#L81-L124
It looks like musl's implementation of pthread_getattr_np isn't as expensive as glibc's version though.
For the full, in-context explanation see apache/arrow-adbc#2021 in context we're seeing a test run go from ~2s to now taking over 5.5s due to this issue and are concerned it may end up affecting consumers of the library
What did you expect to see?
Shifting from go1.22.4 to go1.22.5 shouldn't come with such a significant drop in cgo's FFI performance when calling a Go shared library function from C that the cost is dominated by calls to sscanf if the underlying Go code itself isn't expensive enough. While the example used to reproduce is a trivial one to exemplify the problem, in the context of the actual library, the cgo calls are non-trivial and involve multiple network connection initializations, data processing and more, yet several of the calls end up completely dominated by the sscanf call in pthread_getattr_np.
Either we need find more situations where we can avoid having to fully update the stack bounds via x_cgo_getstackbound or there needs to be a less expensive way to get those bounds than using pthread_getattr_np when using glibc due to the high cost of it just parsing /proc/self/maps.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status