Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auth: Significant performance degradation with new authentication library in Spanner client (100x slower) #10927

Open
toga4 opened this issue Sep 27, 2024 · 5 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@toga4
Copy link
Contributor

toga4 commented Sep 27, 2024

Client

Spanner. probably all clients which uses new auth lib.

Environment

Apple M3

% go version
go version go1.23.1 darwin/arm64
% go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/user/Library/Caches/go-build'
GOENV='/Users/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/user/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/user/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.23.1/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.1/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/user/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/user/work/20240927_spanner-bench/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/09/jr07h_m17kvbbwjgzdckhrfc0000gn/T/go-build4164472710=/tmp/go-build -gno-record-gcc-switches -fno-common'

Code and Dependencies

package main

import (
	"context"
	"log"
	"os"
	"testing"

	"cloud.google.com/go/spanner"
)

func init() {
	os.Setenv("SPANNER_EMULATOR_HOST", "localhost:9010")
}

func Benchmark_WithAuth(b *testing.B) {
	ctx := context.Background()

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		test(ctx)
	}
}

func Benchmark_WithoutAuth(b *testing.B) {
	ctx := context.Background()

	b.Setenv("GOOGLE_API_GO_EXPERIMENTAL_DISABLE_NEW_AUTH_LIB", "true")
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		test(ctx)
	}
}

func test(ctx context.Context) {
	client, err := spanner.NewClient(ctx, "projects/foo/instances/bar/databases/baz") // Doesn't matter if it doesn't exist.
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()
	_, _ = client.Apply(ctx, nil)
}
go.mod
module m

go 1.23.1

require cloud.google.com/go/spanner v1.68.0

require (
	cel.dev/expr v0.16.0 // indirect
	cloud.google.com/go v0.115.1 // indirect
	cloud.google.com/go/auth v0.9.3 // indirect
	cloud.google.com/go/auth/oauth2adapt v0.2.4 // indirect
	cloud.google.com/go/compute/metadata v0.5.0 // indirect
	cloud.google.com/go/monitoring v1.21.0 // indirect
	github.com/GoogleCloudPlatform/grpc-gcp-go/grpcgcp v1.5.0 // indirect
	github.com/GoogleCloudPlatform/opentelemetry-operations-go/detectors/gcp v1.24.1 // indirect
	github.com/census-instrumentation/opencensus-proto v0.4.1 // indirect
	github.com/cespare/xxhash/v2 v2.3.0 // indirect
	github.com/cncf/xds/go v0.0.0-20240822171458-6449f94b4d59 // indirect
	github.com/envoyproxy/go-control-plane v0.13.0 // indirect
	github.com/envoyproxy/protoc-gen-validate v1.1.0 // indirect
	github.com/felixge/httpsnoop v1.0.4 // indirect
	github.com/go-logr/logr v1.4.2 // indirect
	github.com/go-logr/stdr v1.2.2 // indirect
	github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
	github.com/google/s2a-go v0.1.8 // indirect
	github.com/google/uuid v1.6.0 // indirect
	github.com/googleapis/enterprise-certificate-proxy v0.3.4 // indirect
	github.com/googleapis/gax-go/v2 v2.13.0 // indirect
	github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10 // indirect
	go.opencensus.io v0.24.0 // indirect
	go.opentelemetry.io/contrib/detectors/gcp v1.29.0 // indirect
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc v0.54.0 // indirect
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.54.0 // indirect
	go.opentelemetry.io/otel v1.29.0 // indirect
	go.opentelemetry.io/otel/metric v1.29.0 // indirect
	go.opentelemetry.io/otel/sdk v1.29.0 // indirect
	go.opentelemetry.io/otel/sdk/metric v1.29.0 // indirect
	go.opentelemetry.io/otel/trace v1.29.0 // indirect
	golang.org/x/crypto v0.27.0 // indirect
	golang.org/x/net v0.29.0 // indirect
	golang.org/x/oauth2 v0.23.0 // indirect
	golang.org/x/sync v0.8.0 // indirect
	golang.org/x/sys v0.25.0 // indirect
	golang.org/x/text v0.18.0 // indirect
	golang.org/x/time v0.6.0 // indirect
	google.golang.org/api v0.197.0 // indirect
	google.golang.org/genproto v0.0.0-20240903143218-8af14fe29dc1 // indirect
	google.golang.org/genproto/googleapis/api v0.0.0-20240903143218-8af14fe29dc1 // indirect
	google.golang.org/genproto/googleapis/rpc v0.0.0-20240903143218-8af14fe29dc1 // indirect
	google.golang.org/grpc v1.66.2 // indirect
	google.golang.org/protobuf v1.34.2 // indirect
)

Expected behavior

equivalent performance.

Actual behavior

% go test -bench . -benchtime 3s
goos: darwin
goarch: arm64
pkg: m
cpu: Apple M3
Benchmark_WithAuth-8                  19        1008749257 ns/op
Benchmark_WithoutAuth-8             3176           1136428 ns/op
PASS
ok      m       26.054s

Additional context

After upgrading to a recent version of the Google Cloud Go SDK, we observed a significant slowdown in the execution time of our end-to-end tests using Cloud Spanner Emulator. The slowdown occurs at the point of the first communication after creating the Client, which led us to suspect that the issue might be related to authentication.

Upon investigating the differences between versions, we discovered that a new authentication library is being used in the updated SDK. The same issue occurs with the DatabaseAdminClient and InstanceAdminClient. We think that the problem likely affects all clients that rely on the new authentication library.

By switching back to the previous authentication library via environment variables GOOGLE_API_GO_EXPERIMENTAL_DISABLE_NEW_AUTH_LIB, we were able to restore the previous execution times.

While we understand that in actual Google Cloud environments, factors like network latency and the frequency of authentication may make this performance difference less impactful, the issue becomes significant in cases where multiple Client instances are generated, such as in test code using the Emulator. In such scenarios, the slowdown can have a substantial impact, and we believe addressing this would be a meaningful improvement.

@toga4 toga4 added the triage me I really want to be triaged. label Sep 27, 2024
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Sep 27, 2024
@rahul2393
Copy link
Contributor

cc: @quartzmo

@toga4
Copy link
Contributor Author

toga4 commented Sep 27, 2024

Upon further investigation, we found that the latency is due to differences in how the grpc.ClientConn is created between the old and new authentication libraries.

In the previous authentication library, grpc.DialContext is used to create the connection:
https://github.com/googleapis/google-api-go-client/blob/1cd11321a77d6d42ce745edc35c1b9b81a28381f/transport/grpc/dial.go#L354

On the other hand, in the new authentication library, grpc.NewClient is used:

return grpc.NewClient(endpoint, grpcOpts...)

The key difference between grpc.DialContext and grpc.NewClient is default schemes of resolver.
The grpc.DialContext sets the default scheme to passthrough, the grpc.NewClient uses dns as the default.

By analyzing the runtime/trace tool, we discovered that in the new authentication library, takes longer time for resolving DNS. When we tried setting the default scheme to passthrough before running the benchmarks using grpc/resolver.SetDefaultScheme("passthrough"), the latency improved, even with the new authentication library.

I hope this additional investigation is helpful for addressing the issue.

@quartzmo
Copy link
Member

@toga4 Thank you for reporting this issue, and for your thorough and detailed analysis. We will prioritize a fix.

@quartzmo quartzmo changed the title spanner: Significant performance degradation with new authentication library in Spanner client (100x slower) auth: Significant performance degradation with new authentication library in Spanner client (100x slower) Sep 27, 2024
@quartzmo quartzmo added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Sep 27, 2024
@quartzmo
Copy link
Member

@toga4 After investigating this issue a bit, I am not certain that the current implementation default (using the dns scheme instead of passthrough) is detrimental to all users in all environments. As I understand it, the passthrough scheme, being quite literal, prevents users from benefiting from client-side load balancing across all DNS results for an address. The dns scheme, despite the initial slowdown observed, supports such load balancing.

Will it work for you to use grpc/resolver.SetDefaultScheme("passthrough") to improve latency in your test (emulator) environment, and take a wait-and-see approach to evaluating performance in production environments such as GKE?

@toga4
Copy link
Contributor Author

toga4 commented Sep 28, 2024

@quartzmo Thank you for considering about this issue.

You’re absolutely right that some environments could benefit from the client-side load balancing benefits using the dns scheme.

We can take the approach you suggested. Additionaly, I’ve confirmed that using passthrough:///localhost:9010 instead of localhost:9010 in the SPANNER_EMULATOR_HOST environment variable ensures that the passthrough scheme is applied. We’ll proceed with this solution for now.

When using the emulator, I think the benefits of client-side load balancing are limited in most cases. It might be worth considering a change where the passthrough scheme is used by default when the emulator is detected and no scheme is specified in the endpoint configuration.

if emulatorAddr := os.Getenv("SPANNER_EMULATOR_HOST"); emulatorAddr != "" {
emulatorOpts := []option.ClientOption{
option.WithEndpoint(emulatorAddr),
option.WithGRPCDialOption(grpc.WithTransportCredentials(insecure.NewCredentials())),
option.WithoutAuthentication(),
internaloption.SkipDialSettingsValidation(),
}
opts = append(emulatorOpts, opts...)
}

That said, I understand that there are several places in the SDK source code where the emulator is detected, so implementing this change might require updates across multiple locations...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants