Description
What version of gRPC are you using?
1.54.0
What version of Go are you using (go version
)?
go1.20.2 linux/amd64
What operating system (Linux, Windows, …) and version?
Linux 6.2
Bug Report
The configuration of TCP_USER_TIMEOUT
to 20 seconds in #2307 and #5219 together with go default tcp keepalive interval of just 15 seconds results in a tcp connection being reset after a single lost keepalive packet.
Lost packets of a tcp connection are normally being re-transmitted after a short amount of time, well within the 20 seconds timeout. But tcp keepalive packets are not being re-transmitted (ACK segments that contain no data are not reliably transmitted by TCP). Therefore the timeout is reached after just a single lost packet.
Normally not re-transmitting tcp keepalive packets is fine as the connection is only reseted after TCP_KEEPCNT
(default=9) lost keepalive packets.
Test
TCPDump of a test grpc connection (disabled keepalives on the client to reduce packet count, the same issue can be reproduced with default keepalives on both the server and client):
- After packet 199 I dropped all traffic TO port 8090
sudo iptables -I INPUT -p tcp --dport 8090 -j DROP
- Packet 200 gets received by the client and answered in packet 201
- Packet 201 is only visible in the tcpdump but does not get received by the server
- Packet 202 resets the connection by the server when the next keepalive would have been send
Increasing the TCP_USER_TIMEOUT
to 50 seconds results in the connection only being reset after 3 lost keepalive packets.
grpc.KeepaliveParams(
keepalive.ServerParameters{
Timeout: 50 * time.Second,
},
),
Note: Here "keepalive" refers to the grpc/http2 keepalive mechanism and timeout configuration, not the tcp keepalives.