Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: introduce ResolutionStatus for ResolveCb and fix #9927 #10137

Merged
merged 15 commits into from
Feb 28, 2020
3 changes: 0 additions & 3 deletions api/envoy/api/v2/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -643,9 +643,6 @@ message Cluster {
// other than :ref:`STRICT_DNS<envoy_api_enum_value_Cluster.DiscoveryType.STRICT_DNS>` and
// :ref:`LOGICAL_DNS<envoy_api_enum_value_Cluster.DiscoveryType.LOGICAL_DNS>` this setting is
// ignored.
//
// Note: Currently, DNS failures and empty DNS responses are not treated differently and this
// configuration is applied in both situations.
RefreshRate dns_failure_refresh_rate = 44;

// Optional configuration for setting cluster's DNS refresh rate. If the value is set to true,
Expand Down
3 changes: 0 additions & 3 deletions api/envoy/config/cluster/v3/cluster.proto
Original file line number Diff line number Diff line change
Expand Up @@ -645,9 +645,6 @@ message Cluster {
// other than :ref:`STRICT_DNS<envoy_api_enum_value_config.cluster.v3.Cluster.DiscoveryType.STRICT_DNS>` and
// :ref:`LOGICAL_DNS<envoy_api_enum_value_config.cluster.v3.Cluster.DiscoveryType.LOGICAL_DNS>` this setting is
// ignored.
//
// Note: Currently, DNS failures and empty DNS responses are not treated differently and this
// configuration is applied in both situations.
RefreshRate dns_failure_refresh_rate = 44;

// Optional configuration for setting cluster's DNS refresh rate. If the value is set to true,
Expand Down
33 changes: 19 additions & 14 deletions docs/root/intro/arch_overview/upstream/service_discovery.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,10 @@ specified DNS targets. Each returned IP address in the DNS result will be consid
host in the upstream cluster. This means that if the query returns three IP addresses, Envoy will
assume the cluster has three hosts, and all three should be load balanced to. If a host is removed
from the result Envoy assumes it no longer exists and will drain traffic from any existing
connection pools. Note that Envoy never synchronously resolves DNS in the forwarding path. At the
expense of eventual consistency, there is never a worry of blocking on a long running DNS query.
connection pools. Consequently, if a successful DNS resolution returns 0 hosts, Envoy will assume
that the cluster does not have any hosts. Note that Envoy never synchronously resolves DNS in the
forwarding path. At the expense of eventual consistency, there is never a worry of blocking on a
long running DNS query.

If a single DNS name resolves to the same IP multiple times, these IPs will be de-duplicated.

Expand All @@ -39,10 +41,10 @@ This means that care should be taken if active health checking is used with DNS
to the same IPs: if an IP is repeated many times between DNS names it might cause undue load on the
upstream host.

If :ref:`respect_dns_ttl <envoy_api_field_Cluster.respect_dns_ttl>` is enabled, DNS record TTLs and
:ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>` are used to control DNS refresh rate.
For strict DNS cluster, if the minimum of all record TTLs is 0, :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
will be used as the cluster's DNS refresh rate. :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
If :ref:`respect_dns_ttl <envoy_api_field_Cluster.respect_dns_ttl>` is enabled, DNS record TTLs and
:ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>` are used to control DNS refresh rate.
For strict DNS cluster, if the minimum of all record TTLs is 0, :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
will be used as the cluster's DNS refresh rate. :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
defaults to 5000ms if not specified. The :ref:`dns_failure_refresh_rate <envoy_api_field_Cluster.dns_failure_refresh_rate>`
controls the refresh frequency during failures, and, if not configured, the DNS refresh rate will be used.

Expand All @@ -55,7 +57,10 @@ Logical DNS uses a similar asynchronous resolution mechanism to strict DNS. Howe
strictly taking the results of the DNS query and assuming that they comprise the entire upstream
cluster, a logical DNS cluster only uses the first IP address returned *when a new connection needs
to be initiated*. Thus, a single logical connection pool may contain physical connections to a
variety of different upstream hosts. Connections are never drained. This service discovery type is
variety of different upstream hosts. Connections are never drained,
including on a successful DNS resolution that returns 0 hosts.

This service discovery type is
optimal for large scale web services that must be accessed via DNS. Such services typically use
round robin DNS to return many different IP addresses. Typically a different result is returned for
each query. If strict DNS were used in this scenario, Envoy would assume that the cluster’s members
Expand All @@ -65,10 +70,10 @@ When interacting with large scale web services, this is the best of all possible
asynchronous/eventually consistent DNS resolution, long lived connections, and zero blocking in the
forwarding path.

If :ref:`respect_dns_ttl <envoy_api_field_Cluster.respect_dns_ttl>` is enabled, DNS record TTLs and
:ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>` are used to control DNS refresh rate.
For logical DNS cluster, if the TTL of first record is 0, :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
will be used as the cluster's DNS refresh rate. :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
If :ref:`respect_dns_ttl <envoy_api_field_Cluster.respect_dns_ttl>` is enabled, DNS record TTLs and
:ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>` are used to control DNS refresh rate.
For logical DNS cluster, if the TTL of first record is 0, :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
will be used as the cluster's DNS refresh rate. :ref:`dns_refresh_rate <envoy_api_field_Cluster.dns_refresh_rate>`
defaults to 5000ms if not specified. The :ref:`dns_failure_refresh_rate <envoy_api_field_Cluster.dns_failure_refresh_rate>`
controls the refresh frequency during failures, and, if not configured, the DNS refresh rate will be used.

Expand All @@ -80,14 +85,14 @@ Original destination
Original destination cluster can be used when incoming connections are redirected to Envoy either
via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these cases requests routed
to an original destination cluster are forwarded to upstream hosts as addressed by the redirection
metadata, without any explicit host configuration or upstream host discovery.
metadata, without any explicit host configuration or upstream host discovery.
Connections to upstream hosts are pooled and unused hosts are flushed out when they have been idle longer than
:ref:`cleanup_interval <envoy_api_field_Cluster.cleanup_interval>`, which defaults to
5000ms. If the original destination address is not available, no upstream connection is opened.
Envoy can also pickup the original destination from a :ref:`HTTP header
Envoy can also pickup the original destination from a :ref:`HTTP header
<arch_overview_load_balancing_types_original_destination_request_header>`.
Original destination service discovery must be used with the original destination :ref:`load
balancer <arch_overview_load_balancing_types_original_destination>`.
balancer <arch_overview_load_balancing_types_original_destination>`.

.. _arch_overview_service_discovery_types_eds:

Expand Down
1 change: 1 addition & 0 deletions docs/root/intro/version_history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Version history
* adaptive concurrency: fixed bug that allowed concurrency limits to drop below the configured
minimum.
* config: use type URL to select an extension whenever the config type URL (or its previous versions) uniquely identify a typed extension, see :ref:`extension configuration <config_overview_extension_configuration>`.
* dns: the STRICT_DNS cluster now only resolves to 0 hosts if DNS resolution successfully returns 0 hosts.
* http filters: http filter extensions use the "envoy.filters.http" name space. A mapping
of extension names is available in the :ref:`deprecated <deprecated>` documentation.
* ext_authz: disabled the use of lowercase string matcher for headers matching in HTTP-based `ext_authz`.
Expand Down
3 changes: 0 additions & 3 deletions generated_api_shadow/envoy/api/v2/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 0 additions & 3 deletions generated_api_shadow/envoy/config/cluster/v3/cluster.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 8 additions & 3 deletions include/envoy/network/dns.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,12 +45,17 @@ class DnsResolver {
public:
virtual ~DnsResolver() = default;

/**
* Final status for a DNS resolution.
*/
enum class ResolutionStatus { Success, Failure };

/**
* Called when a resolution attempt is complete.
* @param response supplies the list of resolved IP addresses and TTLs. The list will be empty if
junr03 marked this conversation as resolved.
Show resolved Hide resolved
* the resolution failed.
* @param status supplies the final status of the resolution.
* @param response supplies the list of resolved IP addresses and TTLs.
*/
using ResolveCb = std::function<void(std::list<DnsResponse>&& response)>;
using ResolveCb = std::function<void(ResolutionStatus status, std::list<DnsResponse>&& response)>;

/**
* Initiate an async DNS resolution.
Expand Down
8 changes: 6 additions & 2 deletions source/common/network/dns_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i
// callback_ target _should_ still be around. In that case, raise the callback_ so the target
// can be done with this query and initiate a new one.
if (!cancelled_) {
callback_({});
callback_(ResolutionStatus::Failure, {});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have another status like cancelled/destructing/etc.? This is not quite a failure? I'm not sure it matters to higher layer code but it might?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this when I first introduced the destruction behavior. My first opinion is that higher layers only pivot behavior based on success/failure so I was going to keep it at that level of complexity for this change, and the subsequent PR where I fix #9927.

Then I was going to explore the need to add a Destruction value that allow us to set tighter timers than on "regular failure" when fixing for envoyproxy/envoy-mobile#673. But I am not sure if we need different behavior based on what particular failure scenario the resolver experienced. By the same token, people might want even more granularity in the future?

In short: I did not want to commit to more granular status in this initial PR, and I don't think it is going to be that bad to add more granularity, if desired, in subsequent PRs.

}
delete this;
return;
Expand All @@ -112,7 +112,9 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i
}

std::list<DnsResponse> address_list;
ResolutionStatus resolution_status;
if (status == ARES_SUCCESS) {
resolution_status = ResolutionStatus::Success;
if (addrinfo != nullptr && addrinfo->nodes != nullptr) {
if (addrinfo->nodes->ai_family == AF_INET) {
for (const ares_addrinfo_node* ai = addrinfo->nodes; ai != nullptr; ai = ai->ai_next) {
Expand Down Expand Up @@ -146,6 +148,8 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i

ASSERT(addrinfo != nullptr);
ares_freeaddrinfo(addrinfo);
} else {
resolution_status = ResolutionStatus::Failure;
}

if (timeouts > 0) {
Expand All @@ -155,7 +159,7 @@ void DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback(int status, i
if (completed_) {
if (!cancelled_) {
try {
callback_(std::move(address_list));
callback_(resolution_status, std::move(address_list));
} catch (const EnvoyException& e) {
ENVOY_LOG(critical, "EnvoyException in c-ares callback");
dispatcher_.post([s = std::string(e.what())] { throw EnvoyException(s); });
Expand Down
32 changes: 22 additions & 10 deletions source/common/upstream/logical_dns_cluster.cc
Original file line number Diff line number Diff line change
Expand Up @@ -100,23 +100,24 @@ void LogicalDnsCluster::startResolve() {

active_dns_query_ = dns_resolver_->resolve(
dns_address, dns_lookup_family_,
[this, dns_address](std::list<Network::DnsResponse>&& response) -> void {
[this, dns_address](Network::DnsResolver::ResolutionStatus status,
std::list<Network::DnsResponse>&& response) -> void {
active_dns_query_ = nullptr;
ENVOY_LOG(debug, "async DNS resolution complete for {}", dns_address);
info_->stats().update_success_.inc();

std::chrono::milliseconds refresh_rate = dns_refresh_rate_ms_;
if (!response.empty()) {
std::chrono::milliseconds final_refresh_rate = dns_refresh_rate_ms_;

// If the DNS resolver successfully resolved with an empty response list, the logical DNS
// cluster does not update. This ensures that a potentially previously resolved address does
// not stabilize back to 0 hosts.
if (status == Network::DnsResolver::ResolutionStatus::Success && !response.empty()) {
info_->stats().update_success_.inc();
// TODO(mattklein123): Move port handling into the DNS interface.
ASSERT(response.front().address_ != nullptr);
Network::Address::InstanceConstSharedPtr new_address =
Network::Utility::getAddressWithPort(*(response.front().address_),
Network::Utility::portFromTcpUrl(dns_url_));

if (respect_dns_ttl_ && response.front().ttl_ != std::chrono::seconds(0)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this for consistency with the strict dns cluster code. But happy to move back up.

refresh_rate = response.front().ttl_;
}

if (!logical_host_) {
logical_host_.reset(new LogicalHost(info_, hostname_, new_address, localityLbEndpoint(),
lbEndpoint(), nullptr));
Expand All @@ -140,13 +141,24 @@ void LogicalDnsCluster::startResolve() {
logical_host_->setNewAddress(new_address, lbEndpoint());
}

// reset failure backoff strategy because there was a success.
failure_backoff_strategy_->reset();

if (respect_dns_ttl_ && response.front().ttl_ != std::chrono::seconds(0)) {
final_refresh_rate = response.front().ttl_;
}
ENVOY_LOG(debug, "DNS refresh rate reset for {}, refresh rate {} ms", dns_address,
final_refresh_rate.count());
} else {
refresh_rate = std::chrono::milliseconds(failure_backoff_strategy_->nextBackOffMs());
info_->stats().update_failure_.inc();
final_refresh_rate =
std::chrono::milliseconds(failure_backoff_strategy_->nextBackOffMs());
ENVOY_LOG(debug, "DNS refresh rate reset for {}, (failure) refresh rate {} ms",
dns_address, final_refresh_rate.count());
}

onPreInitComplete();
resolve_timer_->enableTimer(refresh_rate);
resolve_timer_->enableTimer(final_refresh_rate);
});
}

Expand Down
Loading