-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Support Circuit Breakers in BackendTrafficPolicy #2284
Conversation
Signed-off-by: Guy Daich <guy.daich@sap.com>
🚀 Thank you for contributing to the Envoy Gateway project! 🚀 Before merging, please ensure to follow the process below:
NOTE: Once your PR is under review, please do not rebase and force push it. Otherwise, it will force your reviewers to review the PR from scratch rather than simply look at your latest changes. What's more, you can help expedite the processing of your PR by
|
🚀 Deployed on https://gateway-pr-2284-preview--eg-preview.netlify.app |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2284 +/- ##
==========================================
+ Coverage 64.38% 64.41% +0.03%
==========================================
Files 112 112
Lines 15874 15882 +8
==========================================
+ Hits 10220 10230 +10
+ Misses 5005 5004 -1
+ Partials 649 648 -1 ☔ View full report in Codecov by Sentry. |
api/v1alpha1/circuitbreaker_types.go
Outdated
// The maximum number of connections that Envoy will make to the referenced backend (per xRoute). | ||
// Default: 1024 | ||
// | ||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better go with CEL for all these fields, for example:
// +kubebuilder:validation:Minimum=xxx
// +kubebuilder:validation:Maximum=xxx
// +kubebuilder:default=xxx
// If not set, circuit breakers will be enabled with the highest supported thresholds | ||
// | ||
// +optional | ||
CircuitBreakers *CircuitBreakers `json:"circuitBreakers,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why plural?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the term used by Envoy, the Backend Traffic Policy and other OSS projects like Emissary Ingress. A single thresholds
struct defines multiple breakers (requests, connection, etc. ), so plural form does sound appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only need one Thresholds
in the CircuitBreakers
structure, the name should be singular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it, please fix the CI lint errors first
Signed-off-by: Guy Daich <guy.daich@sap.com>
// | ||
// +kubebuilder:validation:Minimum=0 | ||
// +kubebuilder:validation:Maximum=4294967295 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 4294967295
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the maximum value of uint32
, but i think this Maximum
validation can be optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, my bad, think about 2147483647
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can -1
pass if the type is uint32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it cannot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zirain, @shawnh2. Do note that the OpenAPI spec (used by K8s CRDs) doesn't really support unsigned ints: https://swagger.io/specification/. The controller-gen tools actually produce a schema that refers to these fields as int32
in the generated CRD. The actual K8s API server behavior, from my limited check, is to treat these fields as int64
. I think that the actual go type (*uint32
) mostly impacts the unmarshalling done by client go. So, guaranteeing that the value stored is actually safe to cast to uint32 could be useful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be to use int64
explicitly in the go types layer and have uint32
as a representation in the IR layer and downwards. The value range validation can occur either using the schema or during the IR translation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like Gateway API project, let's use *int32
with valiation min and max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to use *int64
? MaxUInt32
> MaxInt32
, so by using *int32
users would not able to use the full value range provided by Envoy.
api/v1alpha1/circuitbreaker_types.go
Outdated
// +kubebuilder:validation:Maximum=4294967295 | ||
// +kubebuilder:default=3 | ||
// +optional | ||
MaxRetries *uint32 `json:"maxRetries,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaxRetries
is part of the retry and would be better included in the retry, the corresponding field is maxParallel
#2168
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're talking about two different things here: the MaxConcurrentRetries
of a cluster and the MaxRetries
of an individual request.
MaxConcurrentRetries
belongs to the Circuit Breaker configuration, and MaxRetries
belongs to the Retry configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. However, note that #2168 deals with RetryBudget
which is also a part of Envoy's Circuit Breaker settings.
In Envoy, the separation of route and cluster settings is pretty clear. Multiple routes can point to the same cluster, and so MaxRetries
, RetryBudget
will apply to all the traffic coming from routes that share an upstream cluster. The motivation is to protect the upstream system from a retry storm.
In Envoy Gateway, we have a cluster for each xRoute
. So, it doesn't make much difference if these settings are managed under the Retries
or the CircuitBreakers
section. It is important that the users understand the implications of these settings - overflowing retries will be queued and later dropped.
If in the future Envoy Gateway does support a notion of shared backends (e.g. by translating services to clusters in some situations) and Envoy Gateway will support a SharedBackendTrafficPolicy
, I expect that this policy will include CircuitBreakers
but not Retries
. So, for future reusability, it could be better to have these settings under the circuit breaker types.
Another aspect to consider is that these settings are scoped to a Routing Priority level. As long as only the Default
level is supported, it doesn't really matter where we place these settings. However, if multiple priority levels are supported in the future, the RateLimitStrategy API will need to be extended to support multiple routing priorities, and the translation logic will need to carefully merge that with the other circuit breaker settings.
I'm willing to drop MaxRetries
from this PR for now. We can continue the discussion in #2168 on the best location for these settings in the API and implement it as part of that PR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Envoy Gateway, we have a cluster for each xRoute. So, it doesn't make much difference if these settings are managed under the Retries or the CircuitBreakers section. It is important that the users understand the implications of these settings - overflowing retries will be queued and later dropped.
IMO, the concurrent max retries setting belongs to Circuit Breaker logic because it enforces back pressure on the clients. Therefore, EG probably should not mix it with the request retries configuration.
Use Istio as an example: Istio puts them into two places: the concurrent max retries setting in the DestinationRule and request retries in the VirtualService.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should distinguish between the circuit_breaker
and retryStrategy
functions for users, as they offer distinct features, not limited to the design of Envoy. Even though retry budget
and concurrent max retries
are implemented in the circuit_breaker
in Envoy, for users, these encompass retry functions that provide richer options for retry operations.
Regarding the shared cluster, it's an aspect that requires careful consideration. However, I'm currently uncertain about its usage. It might be an implementation similar to the Istio DestinationRule resource. If that's the case, one could patch the BackendTrafficPolicy to the DestinationRule (DR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should distinguish between the
circuit_breaker
andretryStrategy
functions for users, as they offer distinct features, not limited to the design of Envoy. Even thoughretry budget
andconcurrent max retries
are implemented in thecircuit_breaker
in Envoy, for users, these encompass retry functions that provide richer options for retry operations.
I vote -1 on this.
Even though both have "retries" in their name, they serve two different purposes. The concurrent max retries setting is inherently associated with the Circuit Breaker, which fails requests quickly when a lot of retries happen and apply back pressure on downstream. On the other hand, request retries are specifically designed to mitigate transient network issues. Would love more insights from @kflynn and other @envoyproxy/gateway-maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @arkodg #2284 (comment) and @tmsnan support removing MaxRetries
from this API proposal, I'll go ahead and remove it. If we eventually decide that CircuitBreakers
should contain these settings, we can add them later on.
api/v1alpha1/circuitbreaker_types.go
Outdated
// | ||
// +kubebuilder:validation:MaxItems:=1 | ||
// +optional | ||
Thresholds []Thresholds `json:"thresholds,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the description of this PR, only one Thresholds
is needed here because Envoy Gateway doesn't support Routing Priorities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this as list for two reasons:
- Future-proofing: if/when EG does support routing priorities, there would be no need to add another list. We can just add another optional
priority
field toThresholds
and cancel the length validation. - Compatibility with the original proposal by @AliceProxy in BTP
We can decide to keep things simpler for the common use case of tweaking default circuit breakers, and, in the future, allow a list for non-default priorities.
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: example-policy-with-circuitbreakers
spec:
[...]
circuitBreakers: # optional
maxConnections: uint32 # optional, default=1024
maxPendingRequests: uint32 # optional, default=1024
maxParallelRequests: uint32 # optional, default=1024
maxRetries: uint32 # optional, default=3
additionalThresholds: #optional [Future, not in this PR]
- priority: [HIGH|...]
maxConnections: uint32 # optional, default=1024
maxPendingRequests: uint32 # optional, default=1024
maxParallelRequests: uint32 # optional, default=1024
maxRetries: uint32 # optional, default=3
[...]
If the maintainers support this I don't have any objection . WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. A one-member list just looks weird to me. But would love to hear @AliceProxy 's opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to keep it singular for now
circuitBreaker:
maxConnections:
maxPendingRequests:
maxParallelRequests:
I dont see routing priority being added in the upstream Gateway API, and if it does, we can deprecate circuitBreaker
in favor of circuitBreakers
@@ -65,6 +65,12 @@ type BackendTrafficPolicySpec struct { | |||
// | |||
// +optional | |||
TCPKeepalive *TCPKeepalive `json:"tcpKeepalive,omitempty"` | |||
|
|||
// Circuit Breaker settings for the upstream connections and requests. | |||
// If not set, circuit breakers will be enabled with the highest supported thresholds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If not set, circuit breakers will be enabled with the highest supported thresholds | |
// If not set, circuit breakers will be enabled with default thresholds |
api/v1alpha1/circuitbreaker_types.go
Outdated
} | ||
|
||
type Thresholds struct { | ||
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute). | |
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute per rule). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or can be rephrased to ... to the referenced backend defined within a xRoute rule
api/v1alpha1/circuitbreaker_types.go
Outdated
// +optional | ||
MaxConnections *uint32 `json:"maxConnections,omitempty"` | ||
|
||
// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute). | |
// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute per rule). |
api/v1alpha1/circuitbreaker_types.go
Outdated
// +optional | ||
MaxRequests *uint32 `json:"maxParallelRequests,omitempty"` | ||
|
||
// The maximum number of parallel retries that Envoy will allow to the referenced backend (per xRoute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vote to rm this for now, raise a issue to track max parallel retries, and once the retry API is complete, we can revisit this field and decide on the right home for this
Signed-off-by: Guy Daich <guy.daich@sap.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
vote to rm this for now, raise a issue to track max parallel retries, and once the retry API is complete, we can revisit this field and decide on the right home for this
Raised an issue to make sure we can track this: #2322
API: Circuit Breakers
Relates to #2125, and based on #1821
Overview
The Backend Traffic Policy proposes support for Envoy's Circuit Breaker configuration. Circuit breakers define distributed limits on the volume of requests and connections from Envoy to hosts in a Cluster, with the intention of applying back-pressure when the upstream fails or degrades.
Envoy Proxy enables Circuit Breakers by default. Circuit Breakers cannot be completely disabled, but threshold values can be set to
MaxUint32
. The default threshold value in Envoy (1024) is not appropriate for high-throughput environments and can cause unexpected failures for users who are not aware of the default configuration.Goals
Design Decisions
BackendTrafficPolicy
resource is not attached toxRoute
/Gateway
or does not specify threshold values, Envoy Gateway can use one of the following defaults:MaxUint32
.Thresholds
struct that translates to theDEFAULT
priority is allowed.BackendRefs
to Envoy clusters. As a result, the same upstream backend can be represented by multiple Envoy clusters.Gateway
andxRoute
resources at this time.API Example