Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a lower value for rpc_hold_timeout in DNS server #5929

Open
banks opened this issue Jun 6, 2019 · 0 comments
Open

Use a lower value for rpc_hold_timeout in DNS server #5929

banks opened this issue Jun 6, 2019 · 0 comments
Labels
thinking More time is needed to research by the Consul Contributors type/enhancement Proposed improvement or new feature

Comments

@banks
Copy link
Member

banks commented Jun 6, 2019

Recently we were made aware of an incident involving a badly-behaved DNS client that got into a hot loop and hammered all the Consul servers (directly not via a client) taking the cluster down.

While this load is problematic in general, a particularly nasty failure mode in this case was that once the servers were configured with max_stale = 5s. Since the leader was also under the same onslaught it eventually lost leadership and so after 5 seconds, the other two followers start attempting to forward their requests to the leader.

Since there is now no leader, the RPC gets stuck in our retry loop which retries for rpc_hold_timeout defaulting to 7 seconds (retries every ~200ms). As soon as this happened memory growth on all the servers ballooned even quicker since each single UDP packet sent as DNS was waiting 7 seconds (or often much more because of CPU saturation) holding goroutines and message buffers until it could eventually fail.

The cluster was already down since all servers were overwhelmed with DNS requests. This also won't stop the memory growth since investigation shows that even rate limiting in our own DNS handler doesn't stop the UDP server from allocating for every inbound request which still grows memory faster than the rate limit can shed it and GC can reclaim it. The full fix would require deep changes inside miekg/dns.Server to allow rate limiting before goroutines or buffers are spawned.

But it seems easy to improve the exacerbating issue of the 7 second timeouts. That retry loop is there to try to mask errors on client requests across leadership changes. At least in the case of DNS requests though it seems unlikely that any DNS client is going to wait that long and probably will have dropped the request already and issued a new one, so the resources are just wasted.

We should consider augmenting the RPC retry logic such that we can fail fast on DNS queries at least.

@banks banks added type/enhancement Proposed improvement or new feature thinking More time is needed to research by the Consul Contributors labels Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thinking More time is needed to research by the Consul Contributors type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

1 participant