Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS timeouts cause no healthy upstream #9927

Closed
chrisgoffinet opened this issue Feb 4, 2020 · 6 comments · Fixed by #10137
Closed

DNS timeouts cause no healthy upstream #9927

chrisgoffinet opened this issue Feb 4, 2020 · 6 comments · Fixed by #10137
Assignees
Milestone

Comments

@chrisgoffinet
Copy link

Title: DNS timeouts cause no healthy upstream

[optional Relevant Links:]
I have a test case outlined here on how to reproduce this. I have verified this happens on master branch too.

Bug Template

Description:
In the event that DNS resolver has transient errors (i.e timeout) envoy currently doesn't check the c-ares status of SUCCESS before overriding the address_list with an empty array. We have observed in production cases where we have healthy hosts, and the next time a DNS query happens if it were to timeout, our service ends up going down.

Repro steps:
Test case to reproduce. a simple iptables block on the DNS resolver will do.

https://github.com/chrisgoffinet/envoy-dns
Patch: chrisgoffinet@950c734

@mattklein123
Copy link
Member

This has come up before, and agreed we should fix this to hold the previous values. cc @junr03 who is looking at this code right now to fix a similar issue.

@thedebugger
Copy link

Thanks @chrisgoffinet for reporting this. We have run into this multiple times at Credit Karma when traffic to our kubedns instances is unbalanced. I was curious why no one ran into this earlier.

@junr03
Copy link
Member

junr03 commented Feb 5, 2020

agreed we should fix this to hold the previous values

Yeah definitely should fix. I have some mixed feelings about how to do so. @mattklein123 do you think we should fix by having the PendingQuery not call its callback_ if the address list is empty and the status code is not success? Or should this be handled on the callback receivers' side, i.e the cluster would not go from N hosts to 0 when the address list it resolves is empty.

This relates to my line of thinking in my currently open PR #9899 (comment)

@junr03
Copy link
Member

junr03 commented Feb 5, 2020

I can take care of fixing once we decide on approach given I am now familiar with the code.

@junr03 junr03 self-assigned this Feb 5, 2020
@junr03 junr03 removed the help wanted Needs help! label Feb 5, 2020
@mattklein123 mattklein123 added this to the 1.14.0 milestone Feb 5, 2020
@mattklein123
Copy link
Member

Or should this be handled on the callback receivers' side, i.e the cluster would not go from N hosts to 0 when the address list it resolves is empty.

IMO we should fix the callbacks to indicate that a timeout/error happened and let the caller deal with it, because otherwise we have to start handling retries within the DNS impl code itself, right? WDYT? This is the fix that I had wanted to do for quite some time, and I'm really surprised this has not been raised until now. As an aside, this will almost definitely end up being a long tail issue on Envoy Mobile so definitely worth fixing for your use case anyway.

@junr03
Copy link
Member

junr03 commented Feb 5, 2020

Yeah, that is the direction I was leaning in favor of in #9899 (comment). It leaves the DnsImpl code simpler, it informs the caller with clarity about what happened so the caller can decide to do, and it lets us write easier tests. I see those as wins.

Ok. In that case we can work on landing #9899, and then I can do a subsequent PR that exposes status in the ResolveCb, and updates use cases to deal with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants