Skip to content

gRPC connection closing with unexpected EOF on high-latency connections #5358

@brianneville

Description

@brianneville

What version of gRPC are you using?

1.46.2

What version of Go are you using (go version)?

go version go1.18 darwin/amd64

What operating system (Linux, Windows, …) and version?

Client is running MacOs:
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165

Server is running Linux (CentOS 7)

What did you do?

Note: I have posted a small repo that can be used to reproduce this bug, please check it out here:
https://github.com/brianneville/grpcbug


For high-latency connections, I noticed that RPCs were having their connection terminated with an internal error:
code = Internal desc = unexpected EOF

This hits on both unary and streaming RPCs, and the ease with which this can be reproduced seems to vary with the client-server latency.
For high-latency connections (such 150ms ping), this error hits pretty much every time an RPC is called.
For low-latency connections (<2ms ping) this error is much more infrequent, and hundreds of thousands of messages may be streamed over this connection before the error is hit even once.

I did a bit of digging through the grpc library code, and found two ways that the error can be prevented:

1. Client side - configure window sizes

The error can be prevented from the client side by disabling the dynamic window and BDP estimation for flow control when dialing the grpc server.
That is, setting the DialOptions on the client side to use:

    opts := []grpc.DialOption{
       grpc.WithInitialWindowSize(largerWindowSize),
       grpc.WithInitialConnWindowSize(largerWindowSize),
    }

Where both:

  • largerWindowSize is greater than 65535 (so that dynamic window/flow estimation is turned off)
  • largerWindowSize is greater than the size of the largest RPC response messages (with a bit of overhead for some reason).

2. Server-side - delay ending the stream

The error can be prevented by delaying the calls that write the StatusOk into the transport. Specifically, if the END_STREAM header is delayed from being put into the controlBuf at google.golang.org/grpc/internal/transport/http2_server.go#finishStream.
That is, you can make any of the following changes to *http2Server.finishStream, and the error will not be present:

    time.Sleep(1 * time.Second) // <-- sleeping before put = EOF error is prevented
    t.controlBuf.put(hdr)

or

    go func() {         // <-- allowing finishStream to progress and sleep before put = EOF error is prevented
      time.Sleep(1 * time.Second)
      t.controlBuf.put(hdr)
    }()

When the t.controlBuf.put(hdr) line is delayed in this way, the RPC is allowed to complete normally, and the client will see the response as intended.

Note if you add the sleep after the t.controlBuf.put(hdr), then the error will still be present (i.e. delaying the finishStream function is not what causes the error to be prevented)

    t.controlBuf.put(hdr)
    time.Sleep(1 * time.Second) // <-- sleeping after put = EOF error still occurs 

Would anyone know what might be going on here, or be able to give me some insight/advice for continuing to debug this issue?

What did you expect to see?

Connection latency does not affect RPCs

What did you see instead?

High-latency connection (such as ~150ms ping) reliably hit the error code = Internal desc = unexpected EOF

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions