Skip to content

Conversation

@SolidWallOfCode
Copy link
Member

@SolidWallOfCode SolidWallOfCode commented May 15, 2021

In situations where DNS is lightly used, or when there is only one FQDN requested, if a nameserver fails there is no failover to another nameserver.

The underlying cause is recovery depends on actions triggered from DNSHandler::mainEvent. This means if a nameserver fails on a request, it takes another request for a different FQDN to trigger recovery. In a sidecar situation where only one FQDN is every requested, new requests for that FQDN stack up on the collapsing queue for the failed nameserver and ATS appears to become unresponsive. This fix makes sure that if there is a nameserver failure, DNSHandler::mainEvent is invoked at least once after the failure. If that happens, recovery becomes self sustaining recovery actions cause subsequent invocations of DNSHandler::mainEvent.

This is probably a fix for #4514.

@SolidWallOfCode SolidWallOfCode added this to the 10.0.0 milestone May 15, 2021
@SolidWallOfCode SolidWallOfCode self-assigned this May 15, 2021
@SolidWallOfCode SolidWallOfCode marked this pull request as draft May 17, 2021 21:51
@SolidWallOfCode
Copy link
Member Author

Something is still not quite right. Further testing reveals some oddities I need to track down. At this point I think the problem is that in the case where the nameservers are not round robin, a down nameserver is never actually marked down.

@SolidWallOfCode
Copy link
Member Author

Also, timeouts don't seem to be working. The DNS layer never times out the request and the HostDB layer just keeps stacking up the queue for that host. AFAICT it's the generic inactivity timeout that terminates the connection.

@SolidWallOfCode SolidWallOfCode marked this pull request as ready for review June 1, 2021 14:50
@SolidWallOfCode
Copy link
Member Author

I think I've tracked down all the required changes. Fundamentally the issue was

  1. In the non-RR case, the failed nameserver was not marked as down, and so HostDB requests for the same FQDN would pile up without triggering nameserver failover.
  2. The failover checks would indicate no need for failover even if the the namserver was marked down.

As minor other cleanups

  • The incorrect use of schedule_at was replaced by schedule_in.
  • The down flag is cleared after successfully adding the nameserver connection to epoll, not before.
  • Rate limiting for rescheduling mainEvent for retries was added, otherwise the number of scheduled events grew gradually but without apparent bound.

cur_con.num = icon;
Debug("dns", "opening connection %s SUCCEEDED for %d", ip_text, icon);
cur_con.num = icon;
ns_down[icon] = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note - don't clear the down flag if this nameserver connection isn't in the epoll.

@bryancall bryancall added the Bug label Jun 25, 2021
@bryancall bryancall requested a review from randall June 25, 2021 17:01
@bryancall
Copy link
Contributor

@randall is going to take a look at this.

@bryancall
Copy link
Contributor

@SolidWallOfCode is going to pull this into our internal tree on the 9.1.x branch.

@SolidWallOfCode
Copy link
Member Author

Looks some one else might be seeing this - Nir Finkel on the chat. He's currently going to do some tests to hopefully verify this is the problem and then maybe try out this fix.

@bryancall
Copy link
Contributor

[approve ci]

1 similar comment
@randall
Copy link
Contributor

randall commented Nov 11, 2021

[approve ci]

@ezelkow1
Copy link
Member

ezelkow1 commented Feb 8, 2022

[approve ci]

@apache apache deleted a comment from bryancall Feb 8, 2022
Copy link
Contributor

@brbzull0 brbzull0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@SolidWallOfCode SolidWallOfCode merged commit 2fbec37 into apache:master Feb 8, 2022
bneradt pushed a commit to bneradt/trafficserver that referenced this pull request Feb 10, 2022
apache#7843)"

This reverts commit 2fbec37.

I'm reverting this to test whether this commit introduced CI regression
test instability.
zwoop pushed a commit that referenced this pull request Feb 17, 2022
@zwoop
Copy link
Contributor

zwoop commented Feb 17, 2022

Cherry-picked to v9.2.x

zwoop pushed a commit that referenced this pull request Feb 17, 2022
@zwoop zwoop removed this from the 10.0.0 milestone Feb 17, 2022
@zwoop
Copy link
Contributor

zwoop commented Feb 17, 2022

This was reverted in 92d238a

bneradt pushed a commit to bneradt/trafficserver that referenced this pull request Feb 19, 2022
@SolidWallOfCode SolidWallOfCode added this to the 9.2.0 milestone Feb 24, 2022
moonchen pushed a commit to moonchen/trafficserver that referenced this pull request Mar 17, 2022
* asf/9.2.x:
  Revert "DNS: Fix lack of nameserver failover in low use circumstances. (apache#7843)" (apache#8663)
  Fix strategies to initialize scheme (apache#8650)
  DNS: Fix lack of nameserver failover in low use circumstances. (apache#7843)
  Cleanup strategy debug logs (apache#8656)
  Making 9.2.x backwards compatible with 9.1.x (apache#8661)
  Adds two overridable config variables to control parent mark downs. (apache#8595)
  Fix plugin parent_select missing hostname len (apache#8649)
  Ports apache#7925 apache#8365 core to parent_select plugin (apache#8590)
  Ports apache#7897 from core strategies to parent_select plugin. (apache#8580)
  Adding clangd language server files to .gitignore (apache#8640)
  Make TsSharedMutex.h compile on MacOS. (apache#8645)
  In TsSharedMutex.h, make error reporting thread-safe. (apache#8636)
  Revert "body factory does not respect runroot (apache#8388)" (apache#8654)
  doc: Convert miscased Traffic Server references to |TS| macro (apache#8543)
  Add a new --enable-event-tracker configure option (apache#8179)
  Add parent_select plugin strategy caching (apache#8651)
  TLS Session Resumption: fix timed out session (apache#8667)
  Fix to allow running  from outside top_srcdir (apache#8673)
  Send diags output to stderr when running regression tests. (apache#8662)
  Default proxy.config.http.strict_uri_parsing to "2" (apache#8632)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants