Skip to content

runtime: severe performance drop for cgo calls in go1.22.5 #68587

Closed
@zeroshade

Description

@zeroshade

Go version

go version go1.22.5 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/zeroshade/.cache/go-build'
GOENV='/home/zeroshade/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/zeroshade/go/pkg/mod'
GONOPROXY='github.com/voltrondata/*'
GONOSUMDB='github.com/voltrondata/*'
GOOS='linux'
GOPATH='/home/zeroshade/go'
GOPRIVATE='github.com/voltrondata/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.22'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.22/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.5'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/zeroshade/Projects/GitHub/arrow-adbc/go/adbc/go.mod'
GOWORK='/home/zeroshade/Projects/GitHub/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build279063330=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Building a shared library with Go that is loaded and called from C. I was able to reproduce a trivial example with the following:

package main

import "C"

//export trivialFunctionCall
func trivialFunctionCall(v uintptr) C.int {
    return C.int(v)
}

func main() {}

Build with go build -o libtrivial.so -buildmode=c-shared main.go

#include "libtrivial.h"
#include <stdio.h>

int main() {
    int a = 0;
    for (int i = 0; i < 500; ++i) {
        a = trivialFunctionCall((size_t)i);
    }
    printf("Done %d\n", a);
}

Build with gcc -o trivial trivial.c -ltrivial -L. -Wl,-rpath,..

Then we run and record the performance: perf record --call-graph lbr -- ./trivial and look at the report with perf report -g.

What did you see happen?

The overwhelming amount of time in the cgocall is spent in __isoc99_sscanf as seen here:

+   97.82%     0.00%  trivial  trivial               [.] _start
+   97.82%     0.00%  trivial  libc.so.6             [.] __libc_start_main
+   97.82%     0.00%  trivial  libc.so.6             [.] 0x0000729379829d8e
+   97.82%     0.00%  trivial  trivial               [.] main
-   97.82%     0.00%  trivial  libtrivial.so         [.] trivialFunctionCall
     trivialFunctionCall
     crosscall2
     runtime.cgocallback.abi0
     runtime.cgocallbackg.abi0
   - runtime.cgocallbackg
      - 96.56% runtime.callbackUpdateSystemStack
           runtime.asmcgocall.abi0
         - x_cgo_getstackbound
            - 94.01% pthread_getattr_np
               + 76.00% __isoc99_sscanf
               + 9.14% __getdelim
                 2.96% getline
               + 2.06% fopen
               + 1.97% fclose
               + 1.88% cfree
              2.56% pthread_getattr_np@plt
      + 1.25% runtime.exitsyscall
+   97.82%     0.00%  trivial  libtrivial.so         [.] crosscall2
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallback.abi0
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallbackg.abi0
+   97.82%     0.00%  trivial  libtrivial.so         [.] runtime.cgocallbackg
+   96.56%     2.56%  trivial  libc.so.6             [.] pthread_getattr_np
+   96.56%     0.00%  trivial  libtrivial.so         [.] runtime.callbackUpdateSystemStack
+   96.56%     0.00%  trivial  libtrivial.so         [.] runtime.asmcgocall.abi0
+   96.56%     0.00%  trivial  libtrivial.so         [.] x_cgo_getstackbound
+   76.00%     0.00%  trivial  libc.so.6             [.] __isoc99_sscanf
+   42.52%     0.00%  trivial  libc.so.6             [.] 0x00007293798629e0
+   16.45%     0.00%  trivial  libc.so.6             [.] 0x0000729379865265
+   16.45%     0.00%  trivial  libc.so.6             [.] __strtoull_internal
+   10.59%     3.14%  trivial  libc.so.6             [.] __getdelim

If we instead build using GOTOOLCHAIN=go1.22.4 go build -o libtrivial.so -buildmode=c-shared main.go and re-run it, you see that the cost of callbackUpdateSystemStack and pthread_getattr_np don't even register in the performance trace:

+   92.66%    92.66%  trivial  libtrivial.so  [.] runtime.execute
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.mallocgc
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.gcTrigger.test
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.morestack.abi0
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.newstack
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.goschedImpl
+   92.66%     0.00%  trivial  libtrivial.so  [.] runtime.schedule
+    7.34%     0.00%  trivial  libtrivial.so  [.] runtime.sigtramp.abi0
+    7.34%     0.00%  trivial  libtrivial.so  [.] runtime.sigtrampgo
+    6.78%     6.78%  trivial  libtrivial.so  [.] runtime.sighandler
+    0.56%     0.56%  trivial  libtrivial.so  [.] runtime.sigfwdgo
     0.00%     0.00%  trivial  libtrivial.so  [.] runtime.cgoSigtramp.abi0

The problem was introduced by 3560cf0afb which forces all calls to cgo to update the system stack bounds. Ultimately the problem traces down to the glibc implementation of pthread_getattr_np which is called by

pthread_getattr_np(pthread_self(), &attr); // GNU extension

In glibc, pthread_getattr_np ends up directly parsing /proc/self/maps by using sscanf, hence the high cost as seen here https://github.com/lattera/glibc/blob/master/nptl/pthread_getattr_np.c#L81-L124

It looks like musl's implementation of pthread_getattr_np isn't as expensive as glibc's version though.

For the full, in-context explanation see apache/arrow-adbc#2021 in context we're seeing a test run go from ~2s to now taking over 5.5s due to this issue and are concerned it may end up affecting consumers of the library

What did you expect to see?

Shifting from go1.22.4 to go1.22.5 shouldn't come with such a significant drop in cgo's FFI performance when calling a Go shared library function from C that the cost is dominated by calls to sscanf if the underlying Go code itself isn't expensive enough. While the example used to reproduce is a trivial one to exemplify the problem, in the context of the actual library, the cgo calls are non-trivial and involve multiple network connection initializations, data processing and more, yet several of the calls end up completely dominated by the sscanf call in pthread_getattr_np.

Either we need find more situations where we can avoid having to fully update the stack bounds via x_cgo_getstackbound or there needs to be a less expensive way to get those bounds than using pthread_getattr_np when using glibc due to the high cost of it just parsing /proc/self/maps.

Activity

zeroshade

zeroshade commented on Jul 25, 2024

@zeroshade
Author

It also looks like the Rust devs are running into this problem too: https://internals.rust-lang.org/t/who-is-doing-read-proc-self-maps-1024-at-startup/17348

added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Jul 25, 2024
dmitshur

dmitshur commented on Jul 25, 2024

@dmitshur
Contributor

CC @prattmic (FYI the report reports this is introduced with CL 585935), @golang/runtime.

cherrymui

cherrymui commented on Jul 25, 2024

@cherrymui
Member

Thanks for the report. From the Rust post it sounds like it only does the read if it is called for the main thread. Could you try calling Go from a different thread, a pthread-created one?

zeroshade

zeroshade commented on Jul 25, 2024

@zeroshade
Author

@cherrymui I only included the rust post because it was referring to the same scenario of calling pthread_getattr_np in hindsight it was probably not really great to include as the problem here is frequency (which is not the issue for the Rust post).

In the context of the original issue it's being called through Python loading up the shared library and then making calls to it. As this is a client library that we provide we don't have control over whether consumers call from the main thread or other threads. I can try calling from another thread instead and seeing if that mitigates the issue a bit, but that wouldn't be a viable solution IMO as we can't expect all calls to the shared library to be funneled through a separate thread instead of being able to use the main thread. Right?

cherrymui

cherrymui commented on Jul 25, 2024

@cherrymui
Member

Yes, trying calling from a different thread is for validating the assumption. Calling on the main thread should not be slow, either. So if that's the case we'll look into a targeted solution. Thanks.

zeroshade

zeroshade commented on Jul 25, 2024

@zeroshade
Author

Looks like you're right!

I updated the code from the reproducer to the following:

#include "libtrivial.h"
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void* f(void*) {
    int a = 0;
    for (int i = 0; i < 500; ++i) {
        a = trivialFunctionCall((size_t)i);
    }

    printf("Done %d\n", a);
    pthread_exit(NULL);
}

int main() {
    pthread_t thr;
    int rc;
    if ((rc = pthread_create(&thr, NULL, &f, NULL)) {
        fprintf(stderr, "error: pthread_create: rc: %d\n", rc);
        return EXIT_FAILURE;
    }

    pthread_join(thr, NULL);
}

After running the result using perf record -g and checking the report, sscanf is not showing up at all in the report anymore.

So this just needs to be handled for the main thread as it appears to not cause slowdowns on other threads, only if you call from the main thread.

added a commit that references this issue on Jul 26, 2024
changed the title runtime/cgocall: severe performance drop for cgo calls in go1.22.5 runtime: severe performance drop for cgo calls in go1.22.5 on Jul 26, 2024
mknyszek

mknyszek commented on Jul 31, 2024

@mknyszek
Contributor

In triage, we're thinking we're probably stuck with having to work around this by caching the stack bounds one way or another.

19 remaining items

added this to the Go1.24 milestone on Oct 22, 2024
gopherbot

gopherbot commented on Oct 24, 2024

@gopherbot
Contributor

Change https://go.dev/cl/600296 mentions this issue: runtime: update and restore g0 stack bounds at cgocallback

cherrymui

cherrymui commented on Oct 24, 2024

@cherrymui
Member

I updated CL 600296 to cover this, implemented basically #68587 (comment). If the stack moved, we use a new bounds of [SP, SP+32KB), which I think is not that different from [0, ∞).

I tried on a Linux glibc machine and that gets mostly of the performance back:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Feel free to try the CL, and let me know if there is any issue. Thanks!

jakebailey

jakebailey commented on Oct 24, 2024

@jakebailey
Contributor

I'm happy to say that CL 600296 does fix the performance issue I was observing.

(It appears to be slightly slower than a straight revert of CL 584597, but there is of course new code there to fix the original bug, and it's really, really close, so I of course cannot complain.)

added
NeedsFixThe path to resolution is known, but the work has not been done.
and removed
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Oct 30, 2024
added a commit that references this issue on Oct 30, 2024
76a8409
added
CriticalA critical problem that affects the availability or correctness of production systems built using Go
on Nov 6, 2024
gopherbot

gopherbot commented on Dec 12, 2024

@gopherbot
Contributor

Change https://go.dev/cl/635775 mentions this issue: [release-branch.go1.23] runtime: update and restore g0 stack bounds at cgocallback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

CriticalA critical problem that affects the availability or correctness of production systems built using GoNeedsFixThe path to resolution is known, but the work has not been done.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.release-blocker

Type

No type

Projects

  • Status

    Done

Relationships

None yet

Development

No branches or pull requests

    Participants

    @zeroshade@mknyszek@dmitshur@dr2chase@Splizard

    Issue actions

      runtime: severe performance drop for cgo calls in go1.22.5 · Issue #68587 · golang/go