Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Support Circuit Breakers in BackendTrafficPolicy #2284

Merged
merged 4 commits into from
Dec 19, 2023

Conversation

guydc
Copy link
Contributor

@guydc guydc commented Dec 9, 2023

API: Circuit Breakers

Relates to #2125, and based on #1821

Overview

The Backend Traffic Policy proposes support for Envoy's Circuit Breaker configuration. Circuit breakers define distributed limits on the volume of requests and connections from Envoy to hosts in a Cluster, with the intention of applying back-pressure when the upstream fails or degrades.

Envoy Proxy enables Circuit Breakers by default. Circuit Breakers cannot be completely disabled, but threshold values can be set to MaxUint32. The default threshold value in Envoy (1024) is not appropriate for high-throughput environments and can cause unexpected failures for users who are not aware of the default configuration.

Goals

  • Propose API for Circuit Breakers.
  • Define the default Circuit Breaker settings.
  • Allow users to customize commonly-used Circuit Breaker settings.

Design Decisions

  • Envoy Gateway enables Circuit Breakers by default.
    • If a BackendTrafficPolicy resource is not attached to xRoute/Gateway or does not specify threshold values, Envoy Gateway can use one of the following defaults:
      • Leave Circuit Breakers unset, effectively keeping the Envoy default threshold values.
      • Use opinionated values that are appropriate for typical Gateway scenarios.
      • "Disable" Circuit Breakers by setting the threshold values to MaxUint32.
    • Keeping the default Envoy values ensures greater resilience for most users. Advanced users with higher throughput requirements can adjust the settings.
    • Envoy metrics and access logs provide observability for Circuit Breaker overflow. Unexpected failures related to default circuit breaker overflow can be easily identified by Envoy Gateway users. Envoy Gateway troubleshooting documentation and Grafana dashboards can further assist in this area.
  • Advanced Circuit Breaking options are not supported, but the API definition will allow changes in the future.
    • Routing Priorities are currently not supported by Envoy Gateway. So, only a single Circuit Breaker Thresholds struct that translates to the DEFAULT priority is allowed.
    • Retry Budgets are discussed in api: support retry on in BackendTrafficPolicy #2168 and are not included.
    • Per-Host Thresholds, Maximum Connection Pools and tracking of remaining resources are rarely supported by other projects. These settings are not included.
  • Per-Backend Circuit Breaker settings are not supported.
    • Some users may want and expect Circuit Breakers to apply per Backend, as mentioned here. Per-Backend Circuit Breakers are better synchronized (within the scope of a single Envoy process) and react to failures quickly and efficiently.
    • The Envoy Gateway System Design maps BackendRefs to Envoy clusters. As a result, the same upstream backend can be represented by multiple Envoy clusters.
    • Backend Traffic Policies can only be attached to Gateway and xRoute resources at this time.
    • Support for Per-Backend settings requires further community discussion on Backend translation and Policy Attachment and considered out-of-scope for this PR.

API Example

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy-with-circuitBreakers
spec:
 [...]
  circuitBreakers: # optional
    thresholds: #optional
    - maxConnections: uint32 # optional, default=1024
      maxPendingRequests: uint32 # optional, default=1024
      maxParallelRequests: uint32 # optional, default=1024
      maxRetries: uint32 # optional, default=3
[...]    

Signed-off-by: Guy Daich <guy.daich@sap.com>
@guydc guydc requested a review from a team as a code owner December 9, 2023 00:36
Copy link

github-actions bot commented Dec 9, 2023

🚀 Thank you for contributing to the Envoy Gateway project! 🚀

Before merging, please ensure to follow the process below:

  1. Requesting Reviews:
  • cc @envoyproxy/gateway-reviewers team for an initial review.
  • After the initial review, reviewers should request the @envoyproxy/gateway-maintainers team for further review.
  1. Review Approval:
  • Your PR needs to receive at least two approvals.
  • At least one approval must come from a member of the gateway-maintainers team.

NOTE: Once your PR is under review, please do not rebase and force push it. Otherwise, it will force your reviewers to review the PR from scratch rather than simply look at your latest changes.

What's more, you can help expedite the processing of your PR by
  • Ensuring you have self-reviewed your work according to the project's Contribution Guidelines.
  • If your PR addresses a specific issue, make sure to mention it in the PR description.
  • Respond promptly if there are any test failures or suggestions for improvements that we comment on.

Copy link

github-actions bot commented Dec 9, 2023

@github-actions github-actions bot temporarily deployed to pull request December 9, 2023 00:39 Inactive
Copy link

codecov bot commented Dec 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f71f372) 64.38% compared to head (17964fe) 64.41%.
Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2284      +/-   ##
==========================================
+ Coverage   64.38%   64.41%   +0.03%     
==========================================
  Files         112      112              
  Lines       15874    15882       +8     
==========================================
+ Hits        10220    10230      +10     
+ Misses       5005     5004       -1     
+ Partials      649      648       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// The maximum number of connections that Envoy will make to the referenced backend (per xRoute).
// Default: 1024
//
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better go with CEL for all these fields, for example:

	// +kubebuilder:validation:Minimum=xxx
	// +kubebuilder:validation:Maximum=xxx
	// +kubebuilder:default=xxx

// If not set, circuit breakers will be enabled with the highest supported thresholds
//
// +optional
CircuitBreakers *CircuitBreakers `json:"circuitBreakers,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why plural?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the term used by Envoy, the Backend Traffic Policy and other OSS projects like Emissary Ingress. A single thresholds struct defines multiple breakers (requests, connection, etc. ), so plural form does sound appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only need one Thresholds in the CircuitBreakers structure, the name should be singular.

Copy link
Member

@Xunzhuo Xunzhuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it, please fix the CI lint errors first

Signed-off-by: Guy Daich <guy.daich@sap.com>
@github-actions github-actions bot temporarily deployed to pull request December 12, 2023 23:42 Inactive
//
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=4294967295
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is 4294967295 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the maximum value of uint32, but i think this Maximum validation can be optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, my bad, think about 2147483647.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can -1 pass if the type is uint32?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it cannot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zirain, @shawnh2. Do note that the OpenAPI spec (used by K8s CRDs) doesn't really support unsigned ints: https://swagger.io/specification/. The controller-gen tools actually produce a schema that refers to these fields as int32 in the generated CRD. The actual K8s API server behavior, from my limited check, is to treat these fields as int64. I think that the actual go type (*uint32) mostly impacts the unmarshalling done by client go. So, guaranteeing that the value stored is actually safe to cast to uint32 could be useful...

Copy link
Contributor Author

@guydc guydc Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to use int64 explicitly in the go types layer and have uint32 as a representation in the IR layer and downwards. The value range validation can occur either using the schema or during the IR translation. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like Gateway API project, let's use *int32 with valiation min and max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make more sense to use *int64? MaxUInt32 > MaxInt32, so by using *int32 users would not able to use the full value range provided by Envoy.

@github-actions github-actions bot temporarily deployed to pull request December 13, 2023 02:59 Inactive
@tmsnan tmsnan assigned tmsnan and unassigned tmsnan Dec 13, 2023
@tmsnan tmsnan self-requested a review December 13, 2023 03:53
// +kubebuilder:validation:Maximum=4294967295
// +kubebuilder:default=3
// +optional
MaxRetries *uint32 `json:"maxRetries,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxRetries is part of the retry and would be better included in the retry, the corresponding field is maxParallel #2168

Copy link
Member

@zhaohuabing zhaohuabing Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're talking about two different things here: the MaxConcurrentRetries of a cluster and the MaxRetries of an individual request.

MaxConcurrentRetries belongs to the Circuit Breaker configuration, and MaxRetries belongs to the Retry configuration.

Copy link
Contributor Author

@guydc guydc Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. However, note that #2168 deals with RetryBudget which is also a part of Envoy's Circuit Breaker settings.

In Envoy, the separation of route and cluster settings is pretty clear. Multiple routes can point to the same cluster, and so MaxRetries, RetryBudget will apply to all the traffic coming from routes that share an upstream cluster. The motivation is to protect the upstream system from a retry storm.

In Envoy Gateway, we have a cluster for each xRoute. So, it doesn't make much difference if these settings are managed under the Retries or the CircuitBreakers section. It is important that the users understand the implications of these settings - overflowing retries will be queued and later dropped.

If in the future Envoy Gateway does support a notion of shared backends (e.g. by translating services to clusters in some situations) and Envoy Gateway will support a SharedBackendTrafficPolicy, I expect that this policy will include CircuitBreakers but not Retries. So, for future reusability, it could be better to have these settings under the circuit breaker types.

Another aspect to consider is that these settings are scoped to a Routing Priority level. As long as only the Default level is supported, it doesn't really matter where we place these settings. However, if multiple priority levels are supported in the future, the RateLimitStrategy API will need to be extended to support multiple routing priorities, and the translation logic will need to carefully merge that with the other circuit breaker settings.

I'm willing to drop MaxRetries from this PR for now. We can continue the discussion in #2168 on the best location for these settings in the API and implement it as part of that PR. WDYT?

Copy link
Member

@zhaohuabing zhaohuabing Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Envoy Gateway, we have a cluster for each xRoute. So, it doesn't make much difference if these settings are managed under the Retries or the CircuitBreakers section. It is important that the users understand the implications of these settings - overflowing retries will be queued and later dropped.

IMO, the concurrent max retries setting belongs to Circuit Breaker logic because it enforces back pressure on the clients. Therefore, EG probably should not mix it with the request retries configuration.

Use Istio as an example: Istio puts them into two places: the concurrent max retries setting in the DestinationRule and request retries in the VirtualService.

Copy link
Contributor

@tmsnan tmsnan Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should distinguish between the circuit_breaker and retryStrategy functions for users, as they offer distinct features, not limited to the design of Envoy. Even though retry budget and concurrent max retries are implemented in the circuit_breaker in Envoy, for users, these encompass retry functions that provide richer options for retry operations.

Regarding the shared cluster, it's an aspect that requires careful consideration. However, I'm currently uncertain about its usage. It might be an implementation similar to the Istio DestinationRule resource. If that's the case, one could patch the BackendTrafficPolicy to the DestinationRule (DR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should distinguish between the circuit_breaker and retryStrategy functions for users, as they offer distinct features, not limited to the design of Envoy. Even though retry budget and concurrent max retries are implemented in the circuit_breaker in Envoy, for users, these encompass retry functions that provide richer options for retry operations.

I vote -1 on this.

Even though both have "retries" in their name, they serve two different purposes. The concurrent max retries setting is inherently associated with the Circuit Breaker, which fails requests quickly when a lot of retries happen and apply back pressure on downstream. On the other hand, request retries are specifically designed to mitigate transient network issues. Would love more insights from @kflynn and other @envoyproxy/gateway-maintainers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @arkodg #2284 (comment) and @tmsnan support removing MaxRetries from this API proposal, I'll go ahead and remove it. If we eventually decide that CircuitBreakers should contain these settings, we can add them later on.

//
// +kubebuilder:validation:MaxItems:=1
// +optional
Thresholds []Thresholds `json:"thresholds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the description of this PR, only one Thresholds is needed here because Envoy Gateway doesn't support Routing Priorities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this as list for two reasons:

  • Future-proofing: if/when EG does support routing priorities, there would be no need to add another list. We can just add another optional priority field to Thresholds and cancel the length validation.
  • Compatibility with the original proposal by @AliceProxy in BTP

We can decide to keep things simpler for the common use case of tweaking default circuit breakers, and, in the future, allow a list for non-default priorities.

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy-with-circuitbreakers
spec:
 [...]
  circuitBreakers: # optional
    maxConnections: uint32 # optional, default=1024
    maxPendingRequests: uint32 # optional, default=1024
    maxParallelRequests: uint32 # optional, default=1024
    maxRetries: uint32 # optional, default=3
    additionalThresholds: #optional [Future, not in this PR]
    - priority: [HIGH|...] 
      maxConnections: uint32 # optional, default=1024
      maxPendingRequests: uint32 # optional, default=1024
      maxParallelRequests: uint32 # optional, default=1024
      maxRetries: uint32 # optional, default=3
[...] 

If the maintainers support this I don't have any objection . WDYT?

Copy link
Member

@zhaohuabing zhaohuabing Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. A one-member list just looks weird to me. But would love to hear @AliceProxy 's opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to keep it singular for now

circuitBreaker:
  maxConnections: 
  maxPendingRequests: 
  maxParallelRequests: 

I dont see routing priority being added in the upstream Gateway API, and if it does, we can deprecate circuitBreaker in favor of circuitBreakers

@@ -65,6 +65,12 @@ type BackendTrafficPolicySpec struct {
//
// +optional
TCPKeepalive *TCPKeepalive `json:"tcpKeepalive,omitempty"`

// Circuit Breaker settings for the upstream connections and requests.
// If not set, circuit breakers will be enabled with the highest supported thresholds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If not set, circuit breakers will be enabled with the highest supported thresholds
// If not set, circuit breakers will be enabled with default thresholds

}

type Thresholds struct {
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute).
// The maximum number of connections that Envoy will establish to the referenced backend (per xRoute per rule).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or can be rephrased to ... to the referenced backend defined within a xRoute rule

// +optional
MaxConnections *uint32 `json:"maxConnections,omitempty"`

// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute).
// The maximum number of pending requests that Envoy will queue to the referenced backend (per xRoute per rule).

// +optional
MaxRequests *uint32 `json:"maxParallelRequests,omitempty"`

// The maximum number of parallel retries that Envoy will allow to the referenced backend (per xRoute).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vote to rm this for now, raise a issue to track max parallel retries, and once the retry API is complete, we can revisit this field and decide on the right home for this

Signed-off-by: Guy Daich <guy.daich@sap.com>
@github-actions github-actions bot temporarily deployed to pull request December 18, 2023 21:32 Inactive
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks !

Copy link
Member

@zhaohuabing zhaohuabing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

vote to rm this for now, raise a issue to track max parallel retries, and once the retry API is complete, we can revisit this field and decide on the right home for this

Raised an issue to make sure we can track this: #2322

@arkodg arkodg merged commit 64d7152 into envoyproxy:main Dec 19, 2023
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

8 participants