-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS timeouts cause no healthy upstream #9927
Comments
This has come up before, and agreed we should fix this to hold the previous values. cc @junr03 who is looking at this code right now to fix a similar issue. |
Thanks @chrisgoffinet for reporting this. We have run into this multiple times at Credit Karma when traffic to our kubedns instances is unbalanced. I was curious why no one ran into this earlier. |
Yeah definitely should fix. I have some mixed feelings about how to do so. @mattklein123 do you think we should fix by having the PendingQuery not call its callback_ if the address list is empty and the status code is not success? Or should this be handled on the callback receivers' side, i.e the cluster would not go from N hosts to 0 when the address list it resolves is empty. This relates to my line of thinking in my currently open PR #9899 (comment) |
I can take care of fixing once we decide on approach given I am now familiar with the code. |
IMO we should fix the callbacks to indicate that a timeout/error happened and let the caller deal with it, because otherwise we have to start handling retries within the DNS impl code itself, right? WDYT? This is the fix that I had wanted to do for quite some time, and I'm really surprised this has not been raised until now. As an aside, this will almost definitely end up being a long tail issue on Envoy Mobile so definitely worth fixing for your use case anyway. |
Yeah, that is the direction I was leaning in favor of in #9899 (comment). It leaves the DnsImpl code simpler, it informs the caller with clarity about what happened so the caller can decide to do, and it lets us write easier tests. I see those as wins. Ok. In that case we can work on landing #9899, and then I can do a subsequent PR that exposes status in the ResolveCb, and updates use cases to deal with it. |
Title: DNS timeouts cause no healthy upstream
[optional Relevant Links:]
I have a test case outlined here on how to reproduce this. I have verified this happens on master branch too.
Bug Template
Description:
In the event that DNS resolver has transient errors (i.e timeout) envoy currently doesn't check the c-ares status of
SUCCESS
before overriding the address_list with an empty array. We have observed in production cases where we have healthy hosts, and the next time a DNS query happens if it were to timeout, our service ends up going down.Repro steps:
Test case to reproduce. a simple iptables block on the DNS resolver will do.
https://github.com/chrisgoffinet/envoy-dns
Patch: chrisgoffinet@950c734
The text was updated successfully, but these errors were encountered: