-
Notifications
You must be signed in to change notification settings - Fork 851
DNS: Fix lack of nameserver failover in low use circumstances. #7843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS: Fix lack of nameserver failover in low use circumstances. #7843
Conversation
|
Something is still not quite right. Further testing reveals some oddities I need to track down. At this point I think the problem is that in the case where the nameservers are not round robin, a down nameserver is never actually marked down. |
|
Also, timeouts don't seem to be working. The DNS layer never times out the request and the HostDB layer just keeps stacking up the queue for that host. AFAICT it's the generic inactivity timeout that terminates the connection. |
7364d1e to
d491f50
Compare
d491f50 to
9ad33fd
Compare
|
I think I've tracked down all the required changes. Fundamentally the issue was
As minor other cleanups
|
| cur_con.num = icon; | ||
| Debug("dns", "opening connection %s SUCCEEDED for %d", ip_text, icon); | ||
| cur_con.num = icon; | ||
| ns_down[icon] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note - don't clear the down flag if this nameserver connection isn't in the epoll.
|
@randall is going to take a look at this. |
|
@SolidWallOfCode is going to pull this into our internal tree on the 9.1.x branch. |
|
Looks some one else might be seeing this - Nir Finkel on the chat. He's currently going to do some tests to hopefully verify this is the problem and then maybe try out this fix. |
|
[approve ci] |
1 similar comment
|
[approve ci] |
9ad33fd to
cb76fa5
Compare
|
[approve ci] |
brbzull0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
apache#7843)" This reverts commit 2fbec37. I'm reverting this to test whether this commit introduced CI regression test instability.
(cherry picked from commit 2fbec37)
|
Cherry-picked to v9.2.x |
|
This was reverted in 92d238a |
…mstances. (apache#7843)" (apache#8663)" This reverts commit 92d238a.
* asf/9.2.x: Revert "DNS: Fix lack of nameserver failover in low use circumstances. (apache#7843)" (apache#8663) Fix strategies to initialize scheme (apache#8650) DNS: Fix lack of nameserver failover in low use circumstances. (apache#7843) Cleanup strategy debug logs (apache#8656) Making 9.2.x backwards compatible with 9.1.x (apache#8661) Adds two overridable config variables to control parent mark downs. (apache#8595) Fix plugin parent_select missing hostname len (apache#8649) Ports apache#7925 apache#8365 core to parent_select plugin (apache#8590) Ports apache#7897 from core strategies to parent_select plugin. (apache#8580) Adding clangd language server files to .gitignore (apache#8640) Make TsSharedMutex.h compile on MacOS. (apache#8645) In TsSharedMutex.h, make error reporting thread-safe. (apache#8636) Revert "body factory does not respect runroot (apache#8388)" (apache#8654) doc: Convert miscased Traffic Server references to |TS| macro (apache#8543) Add a new --enable-event-tracker configure option (apache#8179) Add parent_select plugin strategy caching (apache#8651) TLS Session Resumption: fix timed out session (apache#8667) Fix to allow running from outside top_srcdir (apache#8673) Send diags output to stderr when running regression tests. (apache#8662) Default proxy.config.http.strict_uri_parsing to "2" (apache#8632)
In situations where DNS is lightly used, or when there is only one FQDN requested, if a nameserver fails there is no failover to another nameserver.
The underlying cause is recovery depends on actions triggered from
DNSHandler::mainEvent. This means if a nameserver fails on a request, it takes another request for a different FQDN to trigger recovery. In a sidecar situation where only one FQDN is every requested, new requests for that FQDN stack up on the collapsing queue for the failed nameserver and ATS appears to become unresponsive. This fix makes sure that if there is a nameserver failure,DNSHandler::mainEventis invoked at least once after the failure. If that happens, recovery becomes self sustaining recovery actions cause subsequent invocations ofDNSHandler::mainEvent.This is probably a fix for #4514.