DNS timeouts cause no healthy upstream #9927

chrisgoffinet · 2020-02-04T19:04:49Z

Title: DNS timeouts cause no healthy upstream

[optional Relevant Links:]
I have a test case outlined here on how to reproduce this. I have verified this happens on master branch too.

Bug Template

Description:
In the event that DNS resolver has transient errors (i.e timeout) envoy currently doesn't check the c-ares status of SUCCESS before overriding the address_list with an empty array. We have observed in production cases where we have healthy hosts, and the next time a DNS query happens if it were to timeout, our service ends up going down.

Repro steps:
Test case to reproduce. a simple iptables block on the DNS resolver will do.

https://github.com/chrisgoffinet/envoy-dns
Patch: chrisgoffinet@950c734

The text was updated successfully, but these errors were encountered:

mattklein123 · 2020-02-04T19:43:32Z

This has come up before, and agreed we should fix this to hold the previous values. cc @junr03 who is looking at this code right now to fix a similar issue.

thedebugger · 2020-02-04T19:52:45Z

Thanks @chrisgoffinet for reporting this. We have run into this multiple times at Credit Karma when traffic to our kubedns instances is unbalanced. I was curious why no one ran into this earlier.

junr03 · 2020-02-05T18:35:33Z

agreed we should fix this to hold the previous values

Yeah definitely should fix. I have some mixed feelings about how to do so. @mattklein123 do you think we should fix by having the PendingQuery not call its callback_ if the address list is empty and the status code is not success? Or should this be handled on the callback receivers' side, i.e the cluster would not go from N hosts to 0 when the address list it resolves is empty.

This relates to my line of thinking in my currently open PR #9899 (comment)

junr03 · 2020-02-05T18:36:19Z

I can take care of fixing once we decide on approach given I am now familiar with the code.

mattklein123 · 2020-02-05T20:25:48Z

Or should this be handled on the callback receivers' side, i.e the cluster would not go from N hosts to 0 when the address list it resolves is empty.

IMO we should fix the callbacks to indicate that a timeout/error happened and let the caller deal with it, because otherwise we have to start handling retries within the DNS impl code itself, right? WDYT? This is the fix that I had wanted to do for quite some time, and I'm really surprised this has not been raised until now. As an aside, this will almost definitely end up being a long tail issue on Envoy Mobile so definitely worth fixing for your use case anyway.

junr03 · 2020-02-05T20:44:06Z

Yeah, that is the direction I was leaning in favor of in #9899 (comment). It leaves the DnsImpl code simpler, it informs the caller with clarity about what happened so the caller can decide to do, and it lets us write easier tests. I see those as wins.

Ok. In that case we can work on landing #9899, and then I can do a subsequent PR that exposes status in the ResolveCb, and updates use cases to deal with it.

Signed-off-by: Jose Nino <jnino@lyft.com>

mattklein123 added area/dns bug help wanted Needs help! labels Feb 4, 2020

junr03 self-assigned this Feb 5, 2020

junr03 removed the help wanted Needs help! label Feb 5, 2020

mattklein123 added this to the 1.14.0 milestone Feb 5, 2020

junr03 mentioned this issue Feb 10, 2020

dns: faster recovery after no connectivity envoyproxy/envoy-mobile#673

Closed

junr03 mentioned this issue Feb 21, 2020

dns: introduce ResolutionStatus for ResolveCb and fix #9927 #10137

Merged

mattklein123 closed this as completed in #10137 Feb 28, 2020

mattklein123 pushed a commit that referenced this issue Feb 28, 2020

dns: introduce ResolutionStatus for ResolveCb and fix #9927 (#10137)

64d91b2

Signed-off-by: Jose Nino <jnino@lyft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS timeouts cause no healthy upstream #9927

DNS timeouts cause no healthy upstream #9927

chrisgoffinet commented Feb 4, 2020

mattklein123 commented Feb 4, 2020

thedebugger commented Feb 4, 2020

junr03 commented Feb 5, 2020 •

edited

Loading

junr03 commented Feb 5, 2020 •

edited

Loading

mattklein123 commented Feb 5, 2020

junr03 commented Feb 5, 2020 •

edited

Loading

DNS timeouts cause no healthy upstream #9927

DNS timeouts cause no healthy upstream #9927

Comments

chrisgoffinet commented Feb 4, 2020

mattklein123 commented Feb 4, 2020

thedebugger commented Feb 4, 2020

junr03 commented Feb 5, 2020 • edited Loading

junr03 commented Feb 5, 2020 • edited Loading

mattklein123 commented Feb 5, 2020

junr03 commented Feb 5, 2020 • edited Loading

junr03 commented Feb 5, 2020 •

edited

Loading

junr03 commented Feb 5, 2020 •

edited

Loading

junr03 commented Feb 5, 2020 •

edited

Loading