Use a lower value for `rpc_hold_timeout` in DNS server #5929

banks · 2019-06-06T15:23:15Z

Recently we were made aware of an incident involving a badly-behaved DNS client that got into a hot loop and hammered all the Consul servers (directly not via a client) taking the cluster down.

While this load is problematic in general, a particularly nasty failure mode in this case was that once the servers were configured with max_stale = 5s. Since the leader was also under the same onslaught it eventually lost leadership and so after 5 seconds, the other two followers start attempting to forward their requests to the leader.

Since there is now no leader, the RPC gets stuck in our retry loop which retries for rpc_hold_timeout defaulting to 7 seconds (retries every ~200ms). As soon as this happened memory growth on all the servers ballooned even quicker since each single UDP packet sent as DNS was waiting 7 seconds (or often much more because of CPU saturation) holding goroutines and message buffers until it could eventually fail.

The cluster was already down since all servers were overwhelmed with DNS requests. This also won't stop the memory growth since investigation shows that even rate limiting in our own DNS handler doesn't stop the UDP server from allocating for every inbound request which still grows memory faster than the rate limit can shed it and GC can reclaim it. The full fix would require deep changes inside miekg/dns.Server to allow rate limiting before goroutines or buffers are spawned.

But it seems easy to improve the exacerbating issue of the 7 second timeouts. That retry loop is there to try to mask errors on client requests across leadership changes. At least in the case of DNS requests though it seems unlikely that any DNS client is going to wait that long and probably will have dropped the request already and issued a new one, so the resources are just wasted.

We should consider augmenting the RPC retry logic such that we can fail fast on DNS queries at least.

The text was updated successfully, but these errors were encountered:

banks added type/enhancement Proposed improvement or new feature thinking More time is needed to research by the Consul Contributors labels Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a lower value for `rpc_hold_timeout` in DNS server #5929

Use a lower value for `rpc_hold_timeout` in DNS server #5929

banks commented Jun 6, 2019

Use a lower value for rpc_hold_timeout in DNS server #5929

Use a lower value for rpc_hold_timeout in DNS server #5929

Comments

banks commented Jun 6, 2019

Use a lower value for `rpc_hold_timeout` in DNS server #5929

Use a lower value for `rpc_hold_timeout` in DNS server #5929