Use a lower value for rpc_hold_timeout
in DNS server
#5929
Labels
thinking
More time is needed to research by the Consul Contributors
type/enhancement
Proposed improvement or new feature
Recently we were made aware of an incident involving a badly-behaved DNS client that got into a hot loop and hammered all the Consul servers (directly not via a client) taking the cluster down.
While this load is problematic in general, a particularly nasty failure mode in this case was that once the servers were configured with
max_stale = 5s
. Since the leader was also under the same onslaught it eventually lost leadership and so after 5 seconds, the other two followers start attempting to forward their requests to the leader.Since there is now no leader, the RPC gets stuck in our retry loop which retries for
rpc_hold_timeout
defaulting to 7 seconds (retries every ~200ms). As soon as this happened memory growth on all the servers ballooned even quicker since each single UDP packet sent as DNS was waiting 7 seconds (or often much more because of CPU saturation) holding goroutines and message buffers until it could eventually fail.The cluster was already down since all servers were overwhelmed with DNS requests. This also won't stop the memory growth since investigation shows that even rate limiting in our own DNS handler doesn't stop the UDP server from allocating for every inbound request which still grows memory faster than the rate limit can shed it and GC can reclaim it. The full fix would require deep changes inside
miekg/dns.Server
to allow rate limiting before goroutines or buffers are spawned.But it seems easy to improve the exacerbating issue of the 7 second timeouts. That retry loop is there to try to mask errors on client requests across leadership changes. At least in the case of DNS requests though it seems unlikely that any DNS client is going to wait that long and probably will have dropped the request already and issued a new one, so the resources are just wasted.
We should consider augmenting the RPC retry logic such that we can fail fast on DNS queries at least.
The text was updated successfully, but these errors were encountered: