Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: add failure percentage-based outlier detection #8130

Merged
merged 13 commits into from
Sep 12, 2019
27 changes: 27 additions & 0 deletions api/envoy/api/v2/cluster/outlier_detection.proto
Original file line number Diff line number Diff line change
Expand Up @@ -114,4 +114,31 @@ message OutlierDetection {
// is set to true.
google.protobuf.UInt32Value enforcing_local_origin_success_rate = 15
[(validate.rules).uint32.lte = 100];

// The failure percentage to use when determining failure percentage-based outlier detection. If
// the failure percentage of a given host is greater than or equal to this value, it will be
// ejected. Defaults to 85.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable default? I picked this entirely arbitrarily, so I have no attachment to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, my instinct would be to say if the type were OutlierEjectionType::FAILURE_PERCENTAGE and has_failure_percentage_threshold were not set, we'd just fail to accept the config.
Looks like that's not terribly consistent with the rest of outlier config, so yeah, I think any well documented value is fine.
Maybe we can tag this for some v4 cleanup (@htuch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, just to clarify here - alright to leave it as is now and tag with a [#next-major-version] comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely fine to leave as-is. Bonus points if you file a tech debt issue, and mention it should be tagged with v4 (I don't think you can add tags)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on filing a tech debt issue on cleaning up all of these messages, agreed it is a big mess as it has evolved quite a bit trying to not break backwards compat. I can tag it correctly once it is opened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #8219 for this one.

google.protobuf.UInt32Value failure_percentage_threshold = 16 [(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status is detected through
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up
// slowly. Defaults to 0.
google.protobuf.UInt32Value enforcing_failure_percentage = 17 [(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status is detected through
// local-origin failure percentage statistics. This setting can be used to disable ejection or to
// ramp it up slowly. Defaults to 0.
google.protobuf.UInt32Value enforcing_failure_percentage_local_origin = 18
[(validate.rules).uint32.lte = 100];

// The minimum number of hosts in a cluster in order to perform failure percentage-based ejection.
// If the total number of hosts in the cluster is less than this value, failure percentage-based
// ejection will not be performed. Defaults to 5.
google.protobuf.UInt32Value failure_percentage_minimum_hosts = 19;

// The minimum number of total requests that must be collected in one interval (as defined by the
// interval duration above) to perform failure percentage-based ejection for this host. If the
// volume is lower than this setting, failure percentage-based ejection will not be performed for
// this host. Defaults to 50.
google.protobuf.UInt32Value failure_percentage_request_volume = 20;
}
12 changes: 12 additions & 0 deletions api/envoy/data/cluster/v2alpha/outlier_detection_event.proto
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ message OutlierDetectionEvent {
option (validate.required) = true;
OutlierEjectSuccessRate eject_success_rate_event = 9;
OutlierEjectConsecutive eject_consecutive_event = 10;
OutlierEjectFailurePercentage eject_failure_percentage_event = 11;
}
}

Expand Down Expand Up @@ -78,6 +79,12 @@ enum OutlierEjectionType {
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
SUCCESS_RATE_LOCAL_ORIGIN = 4;
// Runs over aggregated success rate statistics from every host in cluster and selects hosts for
// which ratio of failed replies is above configured value.
FAILURE_PERCENTAGE = 5;
// Runs over aggregated success rate statistics for local origin failures from every host in
// cluster and selects hosts for which ratio of failed replies is above configured value.
FAILURE_PERCENTAGE_LOCAL_ORIGIN = 6;
}

// Represents possible action applied to upstream host
Expand All @@ -100,3 +107,8 @@ message OutlierEjectSuccessRate {

message OutlierEjectConsecutive {
}

message OutlierEjectFailurePercentage {
// Host's success rate at the time of the ejection event on a 0-100 range.
uint32 host_success_rate = 1 [(validate.rules).uint32.lte = 100];
}
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,31 @@ outlier_detection.success_rate_stdev_factor
<envoy_api_field_cluster.OutlierDetection.success_rate_stdev_factor>`
setting in outlier detection

outlier_detection.enforcing_failure_percentage
:ref:`enforcing_failure_percentage
<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage>`
setting in outlier detection

outlier_detection.enforcing_failure_percentage_local_origin
:ref:`enforcing_failure_percentage_local_origin
<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage_local_origin>`
setting in outlier detection

outlier_detection.failure_percentage_request_volume
:ref:`failure_percentage_request_volume
<envoy_api_field_cluster.OutlierDetection.failure_percentage_request_volume>`
setting in outlier detection

outlier_detection.failure_percentage_minimum_hosts
:ref:`failure_percentage_minimum_hosts
<envoy_api_field_cluster.OutlierDetection.failure_percentage_minimum_hosts>`
setting in outlier detection

outlier_detection.failure_percentage_threshold
:ref:`failure_percentage_threshold
<envoy_api_field_cluster.OutlierDetection.failure_percentage_threshold>`
setting in outlier detection

Core
----

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,10 @@ statistics will be rooted at *cluster.<name>.outlier_detection.* and contain the
ejections_detected_consecutive_local_origin_failure, Counter, Number of detected consecutive local origin failure ejections (even if unenforced)
ejections_enforced_local_origin_success_rate, Counter, Number of enforced success rate outlier ejections for locally originated failures
ejections_detected_local_origin_success_rate, Counter, Number of detected success rate outlier ejections for locally originated failures (even if unenforced)
ejections_enforced_failure_percentage, Counter, Number of enforced failure percentage outlier ejections. Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_detected_failure_percentage, Counter, Number of detected failure percentage outlier ejections (even if unenforced). Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_enforced_failure_percentage_local_origin, Counter, Number of enforced failure percentage outlier ejections for locally originated failures
ejections_detected_failure_percentage_local_origin, Counter, Number of detected failure percentage outlier ejections for locally originated failures (even if unenforced)
ejections_total, Counter, Deprecated. Number of ejections due to any outlier type (even if unenforced)
ejections_consecutive_5xx, Counter, Deprecated. Number of consecutive 5xx ejections (even if unenforced)

Expand Down
27 changes: 27 additions & 0 deletions docs/root/intro/arch_overview/upstream/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,33 @@ Most configuration items, namely
types of errors, but :ref:`outlier_detection.enforcing_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_success_rate>` applies
to externally originated errors only and :ref:`outlier_detection.enforcing_local_origin_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>` applies to locally originated errors only.

.. _arch_overview_outlier_detection_failure_percentage:

Failure Percentage
^^^^^^^^^^^^^^^^^^

Failure Percentage based outlier ejection functions similarly to the success rate detecion type, in
that it relies on success rate data from each host in a cluster. However, rather than compare those
values to the mean success rate of the cluster as a whole, they are compared to a flat
user-configured threshold. This threshold is configured via the
:ref:`outlier_detection.failure_percentage_threshold<envoy_api_field_cluster.OutlierDetection.failure_percentage_threshold>`
field.

The other configuration fields for failure percentage based ejection are similar to the fields for
success rate ejection. Failure percentage based ejection also obeys
:ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`;
the enforcement percentages for externally- and locally-originated errors are controlled by
:ref:`outlier_detection.enforcing_failure_percentage<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage>`
and
:ref:`outlier_detection.enforcing_failure_percentage_local_origin<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage_local_origin>`,
respectively. As with success rate detection, detection will not be performed for a host if its
request volume over the aggregation interval is less than the
:ref:`outlier_detection.failure_percentage_request_volume<envoy_api_field_cluster.OutlierDetection.failure_percentage_request_volume>`
value. Detection also will not be performed for a cluster if the number of hosts with the minimum
required request volume in an interval is less than the
:ref:`outlier_detection.failure_percentage_minimum_hosts<envoy_api_field_cluster.OutlierDetection.failure_percentage_minimum_hosts>`
value.


.. _arch_overview_outlier_detection_logging:

Expand Down
1 change: 1 addition & 0 deletions docs/root/intro/version_history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Version history
* tracing: added tags for gRPC response status and meesage.
* upstream: added :ref:`an option <envoy_api_field_Cluster.CommonLbConfig.close_connections_on_host_set_change>` that allows draining HTTP, TCP connection pools on cluster membership change.
* upstream: added network filter chains to upstream connections, see :ref:`filters<envoy_api_field_Cluster.filters>`.
* upstream: added new :ref:`failure-percentage based outlier detection<arch_overview_outlier_detection_failure_percentage>` mode.
* upstream: use p2c to select hosts for least-requests load balancers if all host weights are the same, even in cases where weights are not equal to 1.
* zookeeper: parse responses and emit latency stats.

Expand Down
98 changes: 98 additions & 0 deletions source/common/upstream/outlier_detection_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -222,12 +222,22 @@ DetectorConfig::DetectorConfig(const envoy::api::v2::cluster::OutlierDetection&
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, success_rate_request_volume, 100))),
success_rate_stdev_factor_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, success_rate_stdev_factor, 1900))),
failure_percentage_threshold_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, failure_percentage_threshold, 85))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is an established pattern in this file. But it would be good to replace these magic numbers with named constants. Similar to how it is done for Http2Settings here:

struct Http2Settings {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth doing in this PR, or a followup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup PR is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #8221 for this; will update for the new fields once this PR is merged.

failure_percentage_minimum_hosts_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, failure_percentage_minimum_hosts, 5))),
failure_percentage_request_volume_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, failure_percentage_request_volume, 50))),
enforcing_consecutive_5xx_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, enforcing_consecutive_5xx, 100))),
enforcing_consecutive_gateway_failure_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, enforcing_consecutive_gateway_failure, 0))),
enforcing_success_rate_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, enforcing_success_rate, 100))),
enforcing_failure_percentage_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, enforcing_failure_percentage, 0))),
enforcing_failure_percentage_local_origin_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, enforcing_failure_percentage_local_origin, 0))),
split_external_local_origin_errors_(config.split_external_local_origin_errors()),
consecutive_local_origin_failure_(static_cast<uint64_t>(
PROTOBUF_GET_WRAPPED_OR_DEFAULT(config, consecutive_local_origin_failure, 5))),
Expand Down Expand Up @@ -355,6 +365,13 @@ bool DetectorImpl::enforceEjection(envoy::data::cluster::v2alpha::OutlierEjectio
return runtime_.snapshot().featureEnabled(
"outlier_detection.enforcing_local_origin_success_rate",
config_.enforcingLocalOriginSuccessRate());
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE:
return runtime_.snapshot().featureEnabled("outlier_detection.enforcing_failure_percentage",
config_.enforcingFailurePercentage());
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE_LOCAL_ORIGIN:
return runtime_.snapshot().featureEnabled(
"outlier_detection.enforcing_failure_percentage_local_origin",
config_.enforcingFailurePercentageLocalOrigin());
default:
// Checked by schema.
NOT_REACHED_GCOVR_EXCL_LINE;
Expand Down Expand Up @@ -382,6 +399,12 @@ void DetectorImpl::updateEnforcedEjectionStats(
case envoy::data::cluster::v2alpha::OutlierEjectionType::SUCCESS_RATE_LOCAL_ORIGIN:
stats_.ejections_enforced_local_origin_success_rate_.inc();
break;
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE:
stats_.ejections_enforced_failure_percentage_.inc();
break;
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE_LOCAL_ORIGIN:
stats_.ejections_enforced_local_origin_failure_percentage_.inc();
break;
default:
// Checked by schema.
NOT_REACHED_GCOVR_EXCL_LINE;
Expand All @@ -406,6 +429,12 @@ void DetectorImpl::updateDetectedEjectionStats(
case envoy::data::cluster::v2alpha::OutlierEjectionType::SUCCESS_RATE_LOCAL_ORIGIN:
stats_.ejections_detected_local_origin_success_rate_.inc();
break;
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE:
stats_.ejections_detected_failure_percentage_.inc();
break;
case envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE_LOCAL_ORIGIN:
stats_.ejections_detected_local_origin_failure_percentage_.inc();
break;
default:
// Checked by schema.
NOT_REACHED_GCOVR_EXCL_LINE;
Expand Down Expand Up @@ -609,6 +638,63 @@ void DetectorImpl::processSuccessRateEjections(
}
}

void DetectorImpl::processFailurePercentageEjections(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we refactor processSuccessRateEjections for shared utilities? It seems like there's a lot of overlap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass at this - there's still an unfortunate amount of duplication within the function due to the interactions of minimum hosts/request volume, but I was able to drop a few lines. Do you see any other obvious areas to improve?

DetectorHostMonitor::SuccessRateMonitorType monitor_type) {
uint64_t failure_percentage_minimum_hosts =
runtime_.snapshot().getInteger("outlier_detection.failure_percentage_minimum_hosts",
config_.failurePercentageMinimumHosts());
uint64_t failure_percentage_request_volume =
runtime_.snapshot().getInteger("outlier_detection.failure_percentage_request_volume",
config_.failurePercentageRequestVolume());
std::vector<HostSuccessRatePair> valid_failure_percentage_hosts;

// Exit early if there are not enough hosts.
if (host_monitors_.size() < failure_percentage_minimum_hosts) {
return;
}

// reserve upper bound of vector size to avoid reallocation.
valid_failure_percentage_hosts.reserve(host_monitors_.size());

for (const auto& host : host_monitors_) {
if (!host.first->healthFlagGet(Host::HealthFlag::FAILED_OUTLIER_CHECK)) {
absl::optional<double> host_success_rate =
host.second->getSRMonitor(monitor_type)
.successRateAccumulator()
.getSuccessRate(failure_percentage_request_volume);

if (host_success_rate) {
valid_failure_percentage_hosts.emplace_back(
HostSuccessRatePair(host.first, host_success_rate.value()));
}
}
}

if (!valid_failure_percentage_hosts.empty() &&
valid_failure_percentage_hosts.size() >= failure_percentage_minimum_hosts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to check to see if host_monitors_.size() >= failure_percentage_minimum_hosts?
Basically if success_rate_minimum_hosts or failure_percentage_minimum_hosts is low enough we'll gather data for both, but we only want to act on one.

Ditto for avoiding doing work for success rate if success_rate_minimum_hosts is lower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay since valid_failure_percentage_hosts and valid_success_rate_hosts are subsets of host_monitors_, so valid_failure_percentage_hosts.size() >= failure_percentage_minimum_hosts implies host_monitors_.size() >= failure_percentage_minimum_hosts.

const double failure_percentage_threshold =
runtime_.snapshot().getInteger("outlier_detection.failure_percentage_threshold",
config_.failurePercentageThreshold()) /
100.0;

for (const auto& host_success_rate_pair : valid_failure_percentage_hosts) {
if ((100.0 - host_success_rate_pair.success_rate_) >= failure_percentage_threshold) {
// We should eject.

// The ejection type returned by the SuccessRateMonitor's getEjectionType() will be a
// SUCCESS_RATE type, so we need to figure it out for ourselves.
const envoy::data::cluster::v2alpha::OutlierEjectionType type =
(monitor_type == DetectorHostMonitor::SuccessRateMonitorType::ExternalOrigin)
? envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE
: envoy::data::cluster::v2alpha::OutlierEjectionType::
FAILURE_PERCENTAGE_LOCAL_ORIGIN;
updateDetectedEjectionStats(type);
ejectHost(host_success_rate_pair.host_, type);
}
}
}
}

void DetectorImpl::onIntervalTimer() {
MonotonicTime now = time_source_.monotonicTime();

Expand All @@ -626,6 +712,9 @@ void DetectorImpl::onIntervalTimer() {
processSuccessRateEjections(DetectorHostMonitor::SuccessRateMonitorType::ExternalOrigin);
processSuccessRateEjections(DetectorHostMonitor::SuccessRateMonitorType::LocalOrigin);

processFailurePercentageEjections(DetectorHostMonitor::SuccessRateMonitorType::ExternalOrigin);
processFailurePercentageEjections(DetectorHostMonitor::SuccessRateMonitorType::LocalOrigin);

armIntervalTimer();
}

Expand Down Expand Up @@ -660,6 +749,15 @@ void EventLoggerImpl::logEject(const HostDescriptionConstSharedPtr& host, Detect
detector.successRateEjectionThreshold(monitor_type));
event.mutable_eject_success_rate_event()->set_host_success_rate(
host->outlierDetector().successRate(monitor_type));
} else if ((type == envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE) ||
(type == envoy::data::cluster::v2alpha::OutlierEjectionType::
csssuf marked this conversation as resolved.
Show resolved Hide resolved
FAILURE_PERCENTAGE_LOCAL_ORIGIN)) {
const DetectorHostMonitor::SuccessRateMonitorType monitor_type =
(type == envoy::data::cluster::v2alpha::OutlierEjectionType::FAILURE_PERCENTAGE)
? DetectorHostMonitor::SuccessRateMonitorType::ExternalOrigin
: DetectorHostMonitor::SuccessRateMonitorType::LocalOrigin;
event.mutable_eject_failure_percentage_event()->set_host_success_rate(
host->outlierDetector().successRate(monitor_type));
} else {
event.mutable_eject_consecutive_event();
}
Expand Down
Loading