Description
What version of gRPC are you using?
v1.60.1
What version of Go are you using (go version
)?
1.21.6
What operating system (Linux, Windows, …) and version?
Ubuntu 22.04.3, macOS Ventura
What did you do/what did you see?
See reproduction here: https://github.com/sunjayBhatia/grpc-go-issue-repro/tree/main/shutdown-cleanup
Clone the repo above, change to the shutdown-cleanup
directory, run: TEST_COUNT=10 make test
or TEST_COUNT=10 make test-race
(using a higher number of test runs ensure the issue shows up, usually it doesn't take more than a couple runs however)
The repro starts a simple server with a streaming RPC, starts a client connection, sends/receives a message, closes the client connection, stops the server, and exits. The server handler logs when connections are opened/closed. It is hooked up to the test logger, which panics if logs are written after a test ends. We can see in this test that the streaming handler is still running after the test ends: we see the panics with the log lines StreamEcho error in Recv ...
and StreamEcho ended
.
Leaving the repro code as-is, should see a failure like:
% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN TestShutdownCleanupStream
repro_test.go:64: test started
repro_test.go:72: server starting
repro_test.go:88: sending request
repro_test.go:20: StreamEcho started
repro_test.go:93: receiving response
repro_test.go:98: response: text:"foo"
repro_test.go:100: closing client connection
repro_test.go:103: stopping server
repro_test.go:106: server stop finished
repro_test.go:77: server exited
repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.01s)
=== RUN TestShutdownCleanupStream
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended
goroutine 50 [running]:
testing.(*common).logDepth(0xc000218680, {0xc0000f0078, 0x11}, 0x3)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000218680, {0xc0000f93d8, 0x1, 0x1})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x105068640?, 0xc0000b0580?})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000218680, {0xc0000a0780, 0x52}, 0x3)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000218680, {0x104f36bc7, 0xc}, {0xc0000f9748, 0x2, 0x2})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc00003ae40, {0x105135ed8, 0xc0002861c0})
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x10509a540?, 0xc00003ae40}, {0x1051343c0?, 0xc0002be000})
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000174200, {0x1051333e8, 0xc000282270}, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000, 0xc00021a8d0, 0x1054ab720, 0x0)
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc000174200, {0x1051360a8, 0xc0003081a0}, 0xc0002ae000)
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016 +0xa0
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 37
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1027 +0x1d8
FAIL github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup 0.663s
FAIL
make: *** [test-race] Error 1
Looks like the goroutine that spawns the stream handler is not waited for so it may still be running even after we think the server has shut down:
Line 1027 in 953d12a
Similarly, when using a set number of server stream workers (uncomment/comment the relevant lines in the repro code):
% TEST_COUNT=10 make test-race
go test -v -race -count=10 .
=== RUN TestShutdownCleanupStream
repro_test.go:64: test started
repro_test.go:72: server starting
repro_test.go:88: sending request
repro_test.go:93: receiving response
repro_test.go:20: StreamEcho started
repro_test.go:98: response: text:"foo"
repro_test.go:100: closing client connection
repro_test.go:103: stopping server
repro_test.go:106: server stop finished
repro_test.go:109: test ended
--- PASS: TestShutdownCleanupStream (0.00s)
=== RUN TestShutdownCleanupStream
repro_test.go:64: test started
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho error in Recv error: rpc error: code = Canceled desc = context canceled
panic: Log in goroutine after TestShutdownCleanupStream has completed: StreamEcho ended
goroutine 20 [running]:
testing.(*common).logDepth(0xc000183520, {0xc00028e120, 0x11}, 0x3)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Log(0xc000183520, {0xc00022d378, 0x1, 0x1})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1045 +0x74
panic({0x104d5c660?, 0xc00028a8b0?})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/panic.go:920 +0x26c
testing.(*common).logDepth(0xc000183520, {0xc000296960, 0x52}, 0x3)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1022 +0x4f8
testing.(*common).log(...)
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1004
testing.(*common).Logf(0xc000183520, {0x104c2acc7, 0xc}, {0xc00022d6e8, 0x2, 0x2})
/opt/homebrew/Cellar/go/1.21.6/libexec/src/testing/testing.go:1055 +0x84
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho.func1(...)
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:24
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup_test.(*EchoServer).StreamEcho(0xc0001a2e20, {0x104e29f78, 0xc0003821c0})
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/repro_test.go:37 +0x330
github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup/echo._EchoService_StreamEcho_Handler({0x104d8e560?, 0xc0001a2e20}, {0x104e28460?, 0xc0003b4000})
/Users/sunjayb/workspace/grpc-go-issue-repro/shutdown-cleanup/echo/echo_grpc.pb.go:100 +0xb0
google.golang.org/grpc.(*Server).processStreamingRPC(0xc00025c200, {0x104e27488, 0xc00038c210}, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000, 0xc0002688d0, 0x10519f720, 0x0)
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1666 +0x16f4
google.golang.org/grpc.(*Server).handleStream(0xc00025c200, {0x104e2a148, 0xc00008a9c0}, 0xc0003ae000)
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1787 +0x12f0
google.golang.org/grpc.(*Server).serveStreams.func2.1()
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016 +0xa0
google.golang.org/grpc.(*Server).serverWorker(0xc00025c200)
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:618 +0x38
created by google.golang.org/grpc.(*Server).initServerWorkers in goroutine 19
/Users/sunjayb/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:631 +0x1f0
FAIL github.com/sunjayBhatia/grpc-go-issue-repro/shutdown-cleanup 0.511s
FAIL
make: *** [test-race] Error 1
It looks like server workers are left running without full coordination on shutdown as well:
Line 631 in 953d12a
This issue showed up in Contour CI after we bumped from grpc-go v1.59.0 to v1.60.0+
I've tracked this issue via git bisect down to this PR: #6489 which looks like it changed how coordination of stream handlers is set up. Looks like connection draining/closing might be working as expected, but stream handlers are not
What did you expect to see?
When server is stopped (gracefully or otherwise), it waits until all stream handlers finish to exit.