Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection support in DestinationRule for grpc proxyless #3000

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
448 changes: 448 additions & 0 deletions kubernetes/customresourcedefinitions.gen.yaml

Large diffs are not rendered by default.

654 changes: 471 additions & 183 deletions networking/v1alpha3/destination_rule.pb.go

Large diffs are not rendered by default.

177 changes: 175 additions & 2 deletions networking/v1alpha3/destination_rule.pb.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

67 changes: 66 additions & 1 deletion networking/v1alpha3/destination_rule.proto
Original file line number Diff line number Diff line change
Expand Up @@ -962,6 +962,71 @@ message OutlierDetection {
// disabled by setting it to 0%. The default is 0% as it's not typically
// applicable in k8s environments with few pods per service.
int32 min_health_percent = 5;

// FailurePercent Algorithm for deciding if the host is an outlier or not.
FailurePercentageEjection failure_percentage_ejection = 10;

// SuccessRateEjection Algorithm for deciding if the host is an outlier or not.
SuccessRateEjection success_rate_ejection = 11;
}

// Parameters for the failure percentage algorithm.
// This algorithm ejects individual endpoints whose failure rate is greater than
// some threshold, independently of any other endpoint
// Defaults to values declared in envoy.config.cluster.v3.OutlierDetection.SuccessRateEjection
message FailurePercentageEjection {
// The failure percentage to use when determining failure percentage-based outlier detection. If
// the failure percentage of a given address is greater than or equal to this value, it will be
// ejected.
google.protobuf.UInt32Value threshold = 1;

// The % chance that an address will be actually ejected when an outlier status is detected through
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up
// slowly.
google.protobuf.UInt32Value enforcement_percentage = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? I am struggling to see the use case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforcement_percentage this is like what is the probability for a host to be removed from LB pool when detected as outlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what it is, but I am not clear why we need to offer it or why a user would want to set it. "It exists in the Envoy API" is not a reason why, either -- Envoy's API has different design principals than Istio

Copy link

@anuragagarwal561994 anuragagarwal561994 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn I think I understand the use-case here, lets say that there are 500 servers and 50 of them started showing ejection behaviour. Then the outlier detection algorithm will eject all these 50 servers and distribute its load on rest of the 450 servers, this might be a big traffic shift for other servers to handle.

On the contrary, having this setting to lets say 50 will eliminate 25 and then 12 and so on, such that the traffic is ramped up in corresponding intervals.

Although none of the documentations (both enovy / grpc-xds) clearly mentions in which scenario this is applicable. But from what I see it may help prevent issues with temporary spikes and help smoothen out the curve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system.

I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ).


// The minimum number of addresses in order to perform failure percentage-based ejection.
// If the total number of addresses is less than this value, failure percentage-based
// ejection will not be performed.
google.protobuf.UInt32Value minimum_hosts = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in the context of Istio? How do we expect users to set it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn for this and the remaining fields, I am a little confused when we say in the context of istio, I would like to better understand how do we decide which properties do we need to translate to data plane like envoy / grpc-xds, because istio is itself might not be doing any action here excpet for providing the configurations and if a user would like to configure these things for their setup, how would one ensure that they are made available to them via the destination rule otherwise?

I would say these are these parameters are more or less tuning parameters and may not be left as well left to default values (which is 5), because they would otherwise not work for the cases where the service is having less than 5 hosts. They are better tuned as per the traffic patterns and the use cases of the user.

I would have given better answer if I would have used it on staging / production setup and tried and tested it, but to test it we would as well be needing these features in istio.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal.

I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model.

DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute.


// The minimum number of total requests that must be collected in one interval
// to perform failure percentage-based ejection for this address. If the
// volume is lower than this setting, failure percentage-based ejection will not be performed for
// this host.
google.protobuf.UInt32Value request_volume = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is a sliding window or not

Copy link
Author

@aditya-32 aditya-32 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the request for the host should be atleast request_volume count within the interval period in order to be eligible for outlierDetection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in the context of Istio? How do we expect users to set it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn make sense will check and update regarding the fields priority and usecase!

}

// Parameters for the success rate ejection algorithm.
// This algorithm monitors the request success rate for all endpoints and
// ejects individual endpoints whose success rates are statistical outliers.
// Defaults to values declared in envoy.config.cluster.v3.OutlierDetection.SuccessRateEjection
message SuccessRateEjection {
// This factor is used to determine the ejection threshold for success rate
// outlier ejection. The ejection threshold is the difference between the
// mean success rate, and the product of this factor and the standard
// deviation of the mean success rate: mean - (stdev *
// success_rate_stdev_factor). This factor is divided by a thousand to get a
// double. That is, if the desired factor is 1.9, the runtime value should
// be 1900.
google.protobuf.UInt32Value stdev_factor = 1;

// The % chance that an address will be actually ejected when an outlier status
// is detected through success rate statistics. This setting can be used to
// disable ejection or to ramp it up slowly.
google.protobuf.UInt32Value enforcement_percentage = 2;

// The number of addresses that must have enough request volume to
// detect success rate outliers. If the number of addresses is less than this
// setting, outlier detection via success rate statistics is not performed
// for any addresses.
google.protobuf.UInt32Value minimum_hosts = 3;

// The minimum number of total requests that must be collected in one
// interval to include this address in success rate based outlier detection.
// If the volume is lower than this setting, outlier detection via success
// rate statistics is not performed for that address.
google.protobuf.UInt32Value request_volume = 4;
}

// SSL/TLS related settings for upstream connections. See Envoy's [TLS
Expand All @@ -972,7 +1037,7 @@ message OutlierDetection {
// for connections to upstream database cluster.
//
// {{<tabset category-name="example">}}
// {{<tab name="v1alpha3" category-value="v1alpha3">}}
// {{<tab name="v1alpha3" category-value="v1lpha3">}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is accidental.

// ```yaml
// apiVersion: networking.istio.io/v1alpha3
// kind: DestinationRule
Expand Down
42 changes: 42 additions & 0 deletions networking/v1alpha3/destination_rule_deepcopy.gen.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions networking/v1alpha3/destination_rule_json.gen.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading