Description
NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.
Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.
What version of gRPC are you using?
1.60.0
What version of Go are you using (go version
)?
1.22
What operating system (Linux, Windows, …) and version?
Ubuntu 22.04
What did you do?
Upgrade from v1.59.0 to v1.60.0 for grpc client service.
Clients now take 30 seconds to re-resolve upstream services on connection error (server shutting down).
What did you expect to see?
During deploy of upstream grpc server I expect to see no request failing due to delayed DNS resolve in the grpc client. New IP should instantly be resolved after first connection refused error: transport
.
Example (from v1.59.0):
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel picks a new address "10.0.0.20:8000" to connect
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 13:45:19 WARNING: [core] [Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.20:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.20:8000: connect: connection refused"
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.20:8000: connect: connection refused"
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2024/05/02 13:45:19 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "10.0.0.40:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.0.0.40:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel created
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to SHUTDOWN
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel deleted
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel picks a new address "10.0.0.40:8000" to connect
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel Connectivity change to READY
2024/05/02 13:45:19 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:7):{{Addr: "10.0.0.40:8000", ServerName: "", }}]}
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to READY
What did you see instead?
Tested with v1.60.0
and v1.63.2
.
During deploy of upstream grpc server it's a 30 second delay between the first failing request until the new IP is resolved (see logs below).
I've tested this with both single and multiple instance setup for both grpc client and server. We've seen this error for services using a grpc client.
By testing each commit in v1.60.0
I found that the patch from PR #6668 was the first commit that caused this behavior.
Logs output from during deploy of upstream grpc server (v1.60.0
):
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE
2024/05/02 12:42:42 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 12:42:42 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:43 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:44 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:47 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:51 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:43:12 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "10.0.0.160:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.0.0.160:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel created
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to CONNECTING
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel picks a new address "10.0.0.160:8000" to connect
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 12:43:12 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: operation was canceled"
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to READY
2024/05/02 12:43:12 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:4):{{Addr: "10.0.0.160:8000", ServerName: "", }}]}
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to READY
Common client setup for reference:
keepAliveParameters := keepalive.ClientParameters{
Time: 10 * time.Second,
Timeout: time.Second,
PermitWithoutStream: true,
}
defaultServiceConfig := `{"loadBalancingConfig": [{"round_robin":{}}]}`
resolver.SetDefaultScheme("dns")
ctx := context.Background()
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
conn, err := grpc.DialContext(ctx,
"our.interal.domain",
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithBlock(),
grpc.WithKeepaliveParams(grpchelper.keepAliveParameters),
grpc.WithDefaultServiceConfig(defaultServiceConfig),
)