Description
What version of gRPC are you using?
1.46.2
What version of Go are you using (go version
)?
go version go1.18 darwin/amd64
What operating system (Linux, Windows, …) and version?
Client is running MacOs:
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165
Server is running Linux (CentOS 7)
What did you do?
Note: I have posted a small repo that can be used to reproduce this bug, please check it out here:
https://github.com/brianneville/grpcbug
For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF
This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency.
For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called.
For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.
I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:
1. Client side - configure window sizes
The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server.
That is, setting the DialOptions on the client side to use:
opts := []grpc.DialOption{
grpc.WithInitialWindowSize(largerWindowSize),
grpc.WithInitialConnWindowSize(largerWindowSize),
}
Where both:
largerWindowSize
is greater than 65535 (so that dynamic window/flow estimation is turned off)largerWindowSize
is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).
2. Server-side - delay ending the stream
The error can be prevented by delaying the calls that write the StatusOk
into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf
at google.golang.org/grpc/internal/transport/http2_server.go#finishStream
.
That is, you can make any of the following changes to *http2Server.finishStream
, and the error will not be present:
time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
t.controlBuf.put(hdr)
or
go func() { // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
time.Sleep(1 * time.Second)
t.controlBuf.put(hdr)
}()
When the t.controlBuf.put(hdr)
line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.
Note if you add the sleep after the t.controlBuf.put(hdr)
, then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)
t.controlBuf.put(hdr)
time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs
Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?
What did you expect to see?
Connection latency does not affect RPCs
What did you see instead?
High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF