-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED #9899
Conversation
opening as draft until I write tests monday. cc @htuch as the original writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating and fixing this issue. Very excited to see this finally understood and resolved.
Whoops did not mean for you to review my half baked code. Thanks for the comments though, I agree with all of them @mattklein123. Will update and write tests Monday. |
Signed-off-by: Jose Nino <jnino@lyft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 updated with your comments
// The channel cannot be destroyed and reinitialized here because that leads to a c-ares | ||
// segfault. | ||
if (status == ARES_ECONNREFUSED) { | ||
parent_.dirty_channel_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this for a bit and debated between returning here, vs. allowing execution to continue and eventually call the callback_.
I ended up going for allowing the callback for a couple reasons:
- Ease of testing, as having the callback allows for nice event loop triggers and a way to test assertions on the address_list.
- It preserves previous behavior where every time resolution completed, whether successfully or not (with the exception of destruction), the callback_ was called. Obviously after fallbacks were exhausted, and if the query was not cancelled.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work investigating this and fixing! Very excited to see this finally understood.
// The channel cannot be destroyed and reinitialized here because that leads to a c-ares | ||
// segfault. | ||
if (status == ARES_ECONNREFUSED) { | ||
parent_.dirty_channel_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.
// If that flag needs to be set, or c-ares changes its handling this test will need to be updated | ||
// to create another condition where c-ares invokes onAresGetAddrInfoCallback with status == | ||
// ARES_ECONNREFUSED. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is a little fragile but I think it's OK for now. Nice work figuring out a way to test this!
Bumping to include: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS - `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile Signed-off-by: Michael Rebello <me@michaelrebello.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes #672, though more improvements to DNS resolution will be investigated in #673 - `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: envoyproxy/envoy#9959. Fixes #667. Fixes #674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: envoyproxy/envoy#9875 - `config: remove ApiTypeOracle assert`: envoyproxy/envoy#9973 Also contains necessary updates to accommodate [this change](envoyproxy/envoy@dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com>
Description:PendingResolutions get destroyed when complete or when c-ares sent ARES_EDESTRUCTION. Prior to #9899 ARES_EDESTRUCTION only happened when the resolver was destroyed. However, after #9899 the channel, and thus PendingResolutions can be destroyed, without the callback target being aware. This leads to potential use after free issues. This PR calls the resolve callback when the status received is ARES_EDESTRUCTION. Risk Level: med Testing: added test that reproduced segfault and passed after fix. Signed-off-by: Jose Nino <jnino@lyft.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673 - `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: #9875 - `config: remove ApiTypeOracle assert`: #9973 Also contains necessary updates to accommodate [this change](dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Bumping to include the following fixes: - `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673 - `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile - `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674 Includes the following PRs to fix iOS liveliness tests as well: - `metrics service: force link v2 config`: #9875 - `config: remove ApiTypeOracle assert`: #9973 Also contains necessary updates to accommodate [this change](dea4eb0). Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: this PR adds logic to the DnsResolverImpl to destroy and re-initialize its c-ares channel under certain circumstances. A better option would require work in c-ares c-ares/c-ares#301.
Risk Level: med changes in low-level DNS resolution.
Testing: unit tests
Fixes #4543
Signed-off-by: Jose Nino jnino@lyft.com