Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: BackendTrafficPolicy #1821

Closed
Alice-Lilith opened this issue Aug 23, 2023 · 19 comments
Closed

proposal: BackendTrafficPolicy #1821

Alice-Lilith opened this issue Aug 23, 2023 · 19 comments
Assignees
Labels
area/api API-related issues area/policy kind/enhancement New feature or request kind/question Further information is requested no stalebot
Milestone

Comments

@Alice-Lilith
Copy link
Member

Alice-Lilith commented Aug 23, 2023

Relates to

#1492
#1821
#1845

What is this?

BackendTrafficPolicy is a proposal for a new policy attachment resource that can be applied to Gateways and xRoute resources. It is meant to supplement a lot of config that Gateway API currently lacks while also providing a way to specify global defaults. It is focused on configuring the behaviour of traffic between EnvoyProxy and the backend service.

What if Gateway API implements a GEP that delivers the same functionality as some of the fields of this resource?

If Gateway improves any of their existing resources to deliver functionality that meets all of the needs of any of the below config,
then we will deprecate the field in this resource and use the Gateway API config instead. Ideally, a lot of this config can be
upstreamed into Gateway API in some form eventually since this resource only exists to solve areas where Gateway API is lacking for the needs of Envoy Gateway.

What about traffic between EnvoyProxy and the downstream clients?

I will be opening up a sister issue proposing an InboundTrafficPolicy (or similar) resource here shortly that will focus on this use case.

What happens when I attach it to a Gateway resource?

If you apply this resource to a Gateway resource, any config that applies to the "listener level" will be configured there.
A lot of the config is more "route level," and so for those fields, they may not do anything if all you have is a Gateway, but they will be used as the default configuration for all children xRoute resources of the Gateway this resource is attached to.

What happens when I attach it to an xRoute resource (HTTPRoute, GRPCRoute, etc.)

If you apply this config to an xRoute resource, it will apply route-level config to this specific route.
If you have an BackendTrafficPolicy attached to both a Gateway and an xRoute resource, then config from the one
attached to the xRoute will win in a conflict. The "merging" in this case, will be primitive. If you have any sort of
config for a top-level field such as circuitBreakers, then the whole field will be replaced rather than comparing all of the subfields
and merging/patching them.

Can you use multiple BackendTrafficPolicy resources at the same time?

You may use multiple BackendTrafficPolicies that target different resources, but you may not attach multiple BackendTrafficPolicies
to the same resource. For example, it is invalid to attach two BackendTrafficPolicies to the same Gateway. Merging config in this scenario would quickly become convoluted and confusing, and there are many edge-case scenarios that make supporting this undesirable.

How will this be developed/implemented?

This issue serves as a proposal for the high-level design and fields of the API, but that does not mean that every field included below will be immediately available. My plan for the implementation is to add and implement one high-level field at a time until the whole resource is implemented.

API outline

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  dns: # optional
    type: enum ("logical"/"strict") # optional, default=?
    respectDnsTtl: bool # optional
  protocols: # optional
    enableIPv4: bool # optional, default=true
    enableIPv6: bool # optional, default=false
    # Should we add something here for upgrades (websockets/spdy)?
  circuitBreakers: # optional
  - priority: Enum("default"/"high") # optional, default=default
    maxConnections: int # optional, default=1024
    maxPendingRequests: int # optional, default=1024
    maxParallelRequests: int # optional, default=1024
  keepAliveProbes: # optional
    idleTime: duration # optional, default=7200s
    interval: duration # optional, default=75s
    maxProbes: int # optional, default=9
  retryStrategy: # optional
    type: Enum (grpc/http) # required
    http: # required when type=http
      retryOn: Enum (5xx/gateway-error/disconnect-reset/connect-fail/retriable-4xx/refused-stream/retriable-status-codes) # required
      retriableStatusCodes: []int # required when retryOn=retriable-status-codes
    grpc: # required when type=grpc
      retryOn: Enum (cancelled/deadline-exceeded/internal/resource-exhausted/unavailable)
    numRetries: int # optional default=1
    perRetry: # optional
      timeout: duration # optional
      idleTimeout: duration # optional
      backoff: # optional
        baseInterval: duration # required
        maxInterval: duration # optional
        # we can add rate limited based backoff config here if we want to
    retryLimits: # optional
      type: Enum (budget/static)
      static: # required when type=static
        maxParallel: int # optional, default=3
      retryBudget: # required when type=budget
        activeRequestPercent: int # required, min=1, max=100, default=20
        minConcurrent: int # optional, default=3
  # It may make more sense to pull active health checking out into a new CRD
  # to be used as an extensionRef filter. You could create HTTP health checks in all your services following
  # a standard path and parameters, but for gRPC checks at least, you would need to configure them on a per-service basis 
  healthChecking: # optional
    activeChecks: # optional
      - connectionTimeout: duration # optional
        healthyCheckInterval: duration # optional, default=5s
        unhealthyCheckInterval: duration, # optional, default=10s
        unhealthyThreshold: int # optional, default=3
        healthyThreshold: int # optional, default=1
        logging: # optional
          enabled: bool # optional, default=false
          alwaysLogFailures: bool # optional, default=false
        type: Enum (grpc/http)
        grpc: # required when type=grpc
          upstreamName: string # required
          authority: string # optional
        http: # required when type=http
          expectedStatuses: # required, minItems=1
          - min: int # min=100, max=599
            max: int # min=100, max=599
          path: string # optional
          hostname: string # optional
          addRequestHeaders: # optional
            set: # optional
            - name: string # required
              value: string # required
            add: # optional
            - name: string # required
              value: string # required
          removeRequestHeaders: []string
    # There are many more options for outlier detection in Envoy, but let's start small and expand if needed
    outlierDetection: # optional
      checkInterval: duration # required, default=10s
      consecutive5XX: # optional
        threshold: int # optional, default=5
      failurePercentage: # optional
        threshold: int # optional, min=0, max=100, default=85
        minimumRequests: int # optional, default=50
  # The following timeouts field will provide functionality in-addition to the Gateway API GEP for timeouts
  # and allow configuration of Timeouts on other xRoute resources other than only HTTPRoutes
  # https://gateway-api.sigs.k8s.io/geps/gep-1742/
  timeouts: # optional
    connectTimeout: duration # optional
    idleTimeout: duration # optional
    clusterIdleTimeout: duration # optional
    maxConnectionLifetime: duration # optional
  loadBalancing: # optional
    type: Enum(roundRobin,leastRequest,ringHash,maglev) # required
    sessionAffinity: # optional, applies only to RingHash and MagLev. roundRobin and leastRequest have no config.
      type: Enum(cookie,header,sourceIp) # required (note that there is no additional config for sourceIP)
      cookie: # optional
        name: string # required
        path: string # optional
        ttl: string # optional
      header: # optional
        name: string # required
  observability: # optional
    metrics: # optional
      clusterTag: string # optional
      statsName: string # optional
      statsPrefix: string # optional
    tracing:
      samplePercentage: int # (0-100) 
      addTracingTags: # optional
      - type: Enum(literal,requestHeader) # required
        name: string # required 
        literal: # required when type=literal
          value: string # required
        requestHeader: # required when type=requestHeader
          defaultValue: string # required
  matching: # optional
    ignoreAuthorityPort: bool # optional, default=false
    stripAuthorityPort: bool # optional, default=false
    ignorePathParameters: bool # optional, default=false
  integrations:
    addLinkerdHeaders: bool # optional, default=false
  targetRef: # required
    group: string
    kind: string # must be either Gateway or xRoute object reference
    name: string

Things that I considered but felt might make better xRoute externalRef filters

  • regex for headers/method/query params (something like RegexMatchFilter/RegexHeadersFilter)?
  • ip allow/deny per route
  • lua filters (I'm 50/50 on this one. It might be nice to support a global lua script rather than only allowing it per route)
  • grpc json transcoding

Note that I explicitly chose to use the above targets field instead of the Gateway API standard of targetRef/targetRefs.

Using targetRef limits the policy to being attached to one object and it very quickly becomes very annoying to have to create a new BackendTrafficPolicy for each resource you want to attach it to. A very common use case is creating one BackendTrafficPolicy and attaching it to several Gateways, HTTPRoutes, etc.

Looking at targetRefs, I feel like it also quickly becomes very annoying to have to manually specify each resource
you want the BackendTrafficPolicy to attach to. I can see many users wanting to create an BackendTrafficPolicy that attaches to all their HTTPRoutes but not GRPCRoutes (or some similar variation that enables efficient re-use of a single BackendTrafficPolicy). When you have hundreds or more xRoutes, this becomes entirely unmanageable with targetRefs because defining each one individually is a pain. The targets field I added functions as a compromise between targetRefs and a Kubernetes selector, but it likely needs further refinement. Definitely looking for feedback/input here.

Problem: some of this config does not make sense to configure at a route level or gateway level.

For example, it doesn't make any sense to configure gRPC health checks at the Gateway level since it requires knowing the service name, which will be different for each route. Likewise, it does not make sense to configure observability.metrics.statsPrefix at the route level since that is config that is for an entire listener. I think there are a couple of potential solutions here:

  1. Keep things similar to the way I have outlined them, and explain via comments and documentation what the expected behaviour is
    I don't like this solution but it is one way of handling it

Open Questions

  • problem: some of this config does not make sense to configure at a route level or gateway level.
    For example, it doesn't make any sense to configure gRPC health checks at the Gateway level since it requires knowing the service name, which will be different for each route. Likewise, it does not make sense to configure observability.metrics.statsPrefix at the route level since that is config that is for an entire listener. I think there are a couple of potential solutions here:
    • idea 1: Keep things similar to the way I have outlined them, and explain via comments and documentation what the expected behaviour is
      • I don't like this solution but it is one way of handling it
  • This resource can be attached to a Gateway, but you can configure multiple listeners on a single Gateway. Realistically, this resource makes the most sense targeting listeners instead of forcing this to apply to the entire gateway. There should be a way to specify which listeners it applies to so that you can have different config for each listener.

Edits

  1. cors and bypassAuth configurations moved to the proposal for AuthPolicy resource.
  2. initial design for targets switched to a single targetRef for now, but I do think we must solve for the use-case of applying an BackendTrafficPolicy to multiple resources in a flexible way for the resource to be truly useful.
  3. Renamed UpstreamTrafficPolicy to BackendTrafficPolicy to reduce potential confusion about the traffic that this policy affects.
@Alice-Lilith Alice-Lilith added kind/enhancement New feature or request kind/question Further information is requested area/api API-related issues no stalebot labels Aug 23, 2023
@Alice-Lilith Alice-Lilith self-assigned this Aug 23, 2023
@Alice-Lilith
Copy link
Member Author

CC @envoyproxy/gateway-maintainers, @LanceEa @haq204 @kflynn for feedback and input on the initial design and the open questions I left.

@arkodg
Copy link
Contributor

arkodg commented Aug 23, 2023

relates to #1492

@arkodg
Copy link
Contributor

arkodg commented Aug 23, 2023

thanks for raising this issue @AliceProxy, adding follow up questions about the higher level API (UpstreamTrafficPolicy) that im hoping we can get clarity on (and not the specific fields defined in the example) before proceeding

  1. looks like if Policy A is applied to Gateway and Policy B is applied to xRoute linked to the same Gateway, then Policy B wins i.e. all fields within policy B win. This may not be ideal for some situations where the platform admin creating Policy A may want to enforce some fields (like authentication) or may want to
    default some fields (like loadBalancing). For such cases having an explicit defaults and overrides field as outlined in the Inherited Policy Attachment https://gateway-api.sigs.k8s.io/geps/gep-713/#inherited-policy-attachment-its-all-about-the-defaults-and-overrides does make sense here
  2. Allowing the Policy to attach to many resources using targets makes debugging harder and also computing Status, prefer if we started with a single targetRef and got this right, before supporting multiple. Also it might make sense to wait for upstream guidance on this, here's the open discussion Why does the direct policy attachment only attch one object? kubernetes-sigs/gateway-api#2300
  3. Around Problem: some of this config does not make sense to configure at a route level or gateway level.
    Imo such config should live inside the EnvoyProxy API, we already have fields here that affect the listener such as tracing and access logs, and once upstream adds support for a parametersRef like field at the Gateway level, we should be able to apply this config per Gateway. GatewayTrafficPolicy also makes sense, but we'd have to deal with default/overrides b/w EnvoyProxy at two levels (GatewayClass, Gateway) and GatewayTraficPolicy 😆

@Alice-Lilith
Copy link
Member Author

Alice-Lilith commented Aug 23, 2023

@arkodg

looks like if Policy A is applied to Gateway and Policy B is applied to xRoute linked to the same Gateway, then Policy B wins i.e. all fields within policy B win.

In this scenario, policy B (attached to the xRoute) would win only for fields that they both configure. IE: if policy A configured circuitBreakers and timeouts and policy B configured circuitBreakers then policy B would win and use it's own config for circuitBreakers while still using the default timeouts config from policy A since it did not configure that field.

This may not be ideal for some situations where the platform admin creating Policy A may want to enforce some fields (like authentication) or may want to default some fields (like loadBalancing).

Using defaults and overrides still has this same problem though where if you set policy A to have overrides for circuitBreakers and policy B has overrides for circuitBreakers then policy B is still overrides policy A. My reasoning for not adding overrides/defaults is also that Emissary has a similar system to what I supplied above where you have a Module resource that can set defaults, then each individual Mapping(HTTPRoute analogue) has the ability to configure the same fields. If the Mapping and Module configure the same field, then the Mapping overrides the Module config. With Emissary, the only times I can recall users asking for a similar defaults/overrides system is with Authentication. I'm not super opposed to it, I just didn't think there are enough real use-cases to warrant the complexity increase, but that's just one person's opinion.

Allowing the Policy to attach to many resources using targets makes debugging harder and also computing Status, prefer if we started with a single targetRef and got this right, before supporting multiple.

Yep it definitely would make debugging a bit harder, but I think there are ways to solve for this. For example, we could add a new functionality to egctl to target a policy and have it print out all the resources it is successfully and unsuccessfully attached to. It's a larger symptom of the fact that this is solving problems in resources (Gateway API) that we don't control, so any tradeoff we make here is going to have some non-negligible downsides. I see the pattern of wanting to use one resource to apply common config to many other resources used/requested a lot since like I mentioned, creating tons of these with the same config is going to be a nightmare for anyone actually using this in prod. Plus we would have to consider that having it attach to only one resource and then updating that later to attach to several would be a breaking change. If that change happened before GA I don't think it would be a big deal, but any time after GA I think would cause a lot of friction for users and require a major version bump. I think the complexity here would largely be on us developers of EG, but the end users don't really care about our workflow and mostly just want a user experience that has the least amount of friction.

Around Problem: some of this config does not make sense to configure at a route level or gateway level.
Imo such config should live inside the EnvoyProxy API, we already have fields here that affect the listener such as tracing and access logs, and once upstream adds support for a parametersRef like field at the Gateway level, we should be able to apply this config per Gateway. GatewayTrafficPolicy also makes sense, but we'd have to deal with default/overrides b/w EnvoyProxy at two levels (GatewayClass, Gateway) and GatewayTraficPolicy 😆

Great idea, lets go ahead and do that, then shall we plan to keep the other resource as DownstreamTrafficPolicy unless something else comes up?

I think for points 1 and 2 let's try to get some more folks to weigh in and then go with whatever seems to resonate with more people for each one. Or maybe someone else will have additional ideas.

@arkodg
Copy link
Contributor

arkodg commented Aug 24, 2023

  • yeah the major reason to add overrides is to enforce fields like authentication by a more privileged user, in this case, the platform admin who can configure policies and attach them to the Gateway

  • I still like DownstreamTrafficPolicy since it might help us define fine grained features per Gateway listener such as

    • Proxy Protocol
    • Authentication (TLS)
    • HTTP3
    • Keep Alive

@arkodg
Copy link
Contributor

arkodg commented Aug 25, 2023

@zirain @AliceProxy and I discussed this issue in the community meeting yesterday and managed to build some consensus on some of the above points

  • overrides and defaults fields within the API might be cumbersome for the user, but to achieve the goal of being able to enforce/override some fields by a privileged user, such as the platform admin who configures the Gateway resource, mostly tied to security , it makes sense to separate those fields into a separate Security / APIThatEnforcesFields API that can be applied at Gateway and HTTPRoute
  • maybe easier to start off with supporting a single targetRef before dealing with the ability to attach to multiple resources

This was referenced Aug 29, 2023
@Alice-Lilith Alice-Lilith changed the title proposal: UpstreamTrafficPolicy proposal: BackendTrafficPolicy Sep 14, 2023
@arkodg
Copy link
Contributor

arkodg commented Sep 14, 2023

thanks for updating the API name, was chatting with @ZackButcher who mentioned that UpstreamTrafficPolicy is too technical for the end user, and recommended we use OutboudTrafficPolicy or BackendTrafficPolicy .

@kflynn
Copy link
Contributor

kflynn commented Sep 14, 2023

I was thinking about this too -- I kind of like ClientTrafficPolicy and BackendTrafficPolicy.

@arkodg
Copy link
Contributor

arkodg commented Sep 14, 2023

ClientTrafficPolicy may introduce the confusion of external client vs internal client :)
similarly OutboundTrafficPolicy may introduce confusion of outbound from backend (response/egress) vs outbound from gateway

@kflynn
Copy link
Contributor

kflynn commented Sep 15, 2023

All the choices have some ambiguity (remember that Envoy's use of "upstream" and "downstream" is the reverse of common usage in some other places). Client and Backend are probably distinct enough, but let's see what others think.

@arkodg
Copy link
Contributor

arkodg commented Sep 15, 2023

Oh @kflynn are you recommending replacing DownstreamTrafficPolicy/InboundTrafficPolicy with ClientTrafficPolicy ? Id be a +1 for that

@zhaohuabing
Copy link
Member

zhaohuabing commented Sep 16, 2023

I like Client/Backend because the names speak from the Gateway's viewpoint, and there should be no confusion about it.

@kflynn
Copy link
Contributor

kflynn commented Sep 16, 2023

RIght, @arkodg – use ClientTrafficPolicy instead of DownstreamTrafficPolicy and BackendTrafficPolicy instead of UpstreamTrafficPolicy.

@kflynn
Copy link
Contributor

kflynn commented Sep 16, 2023

I'd also like to note that I am still strongly opposed to having defaults and overrides stanzas.

In real-world usage with Emissary, having the ability to set defaults has repeatedly proven critical. Attaching a TrafficPolicy to a Gateway meets this need, and allows overriding it at the Route level by attaching exactly the same kind of resource to the Route: there's no need to put everything under a defaults stanza, because the user will very easily understand the hierarchical relationship based on where a TrafficPolicy is attached.

(In fact, requiring them to state defaults is harder to understand. It makes a certain about of sense when the TrafficPolicy is attached to the Gateway, but when attached to the Route it implies that you're only setting the default behavior and that there's yet a different way to set the actual behavior. Just ditch it, and let the point of attachment define the hierarchy.)

I have a harder time coming up with real-world use cases for overrides: for the most part, it simply doesn't make sense to say things like "we're going to set a backend timeout of 3s and you cannot change that". Given that we're still quite far from a GA release, I would strongly prefer to not define this kind of override mechanism until we find an actual real-world use case asking for it.

@arkodg
Copy link
Contributor

arkodg commented Sep 16, 2023

I'd also like to note that I am still strongly opposed to having defaults and overrides stanzas.

In real-world usage with Emissary, having the ability to set defaults has repeatedly proven critical. Attaching a TrafficPolicy to a Gateway meets this need, and allows overriding it at the Route level by attaching exactly the same kind of resource to the Route: there's no need to put everything under a defaults stanza, because the user will very easily understand the hierarchical relationship based on where a TrafficPolicy is attached.

(In fact, requiring them to state defaults is harder to understand. It makes a certain about of sense when the TrafficPolicy is attached to the Gateway, but when attached to the Route it implies that you're only setting the default behavior and that there's yet a different way to set the actual behavior. Just ditch it, and let the point of attachment define the hierarchy.)

I have a harder time coming up with real-world use cases for overrides: for the most part, it simply doesn't make sense to say things like "we're going to set a backend timeout of 3s and you cannot change that". Given that we're still quite far from a GA release, I would strongly prefer to not define this kind of override mechanism until we find an actual real-world use case asking for it.

I agree with you here, this decision will make us deviate from the upstream recommendation which was my only hesitation, but there isn't anyone who has voiced a strong need for explicit overrides and defaults so its fine to start off with implicit overrides

@zhaohuabing
Copy link
Member

zhaohuabing commented Sep 18, 2023

IMO, adding the overrides and defaults parts in TrafficPolicy/SecurityPolicy makes them very difficult to understand. I initially understood how they work when I read through the corresponding section of the Gateway API , but I found that I completely forgot about it after a day or two. :-) The main issue is that EG users will need to put in extra effort to figure out the combined effective policy, as it's not as straightforward as it could be.

@arkodg
Copy link
Contributor

arkodg commented Sep 19, 2023

top features I would like in this API

we shouldnt express timeouts or retries in here since this work has already started upstream and is partially ready (timeouts & perTryTimeouts will be available soon)

@kflynn
Copy link
Contributor

kflynn commented Sep 21, 2023

I think we should go ahead and define retries in the BackendTrafficPolicy for now, since we're not sure yet that that can make it into Gateway API 1.0 -- just let's bear in mind that there's a parallel effort there and we may need to adapt to the upstream version when it happens.

@kflynn
Copy link
Contributor

kflynn commented Sep 21, 2023

Outlier detection and fault injection are definitely very interesting – worth noting that outlier detection has some interaction with the load balancing policy as well, but both would be great to have and I think both belong here.

@arkodg arkodg added this to the 0.6.0-rc1 milestone Sep 21, 2023
@arkodg arkodg self-assigned this Oct 23, 2023
arkodg added a commit to arkodg/gateway that referenced this issue Oct 24, 2023
Relates to envoyproxy#1821 & envoyproxy#2006

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
arkodg added a commit to arkodg/gateway that referenced this issue Oct 25, 2023
Relates to envoyproxy#1821 & envoyproxy#2006

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
arkodg added a commit that referenced this issue Oct 25, 2023
* Add rateLimit to BackendTrafficPolicy

Relates to #1821 & #2006

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* tests

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* lint

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* fix build

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
@arkodg arkodg closed this as completed Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api API-related issues area/policy kind/enhancement New feature or request kind/question Further information is requested no stalebot
Projects
None yet
Development

No branches or pull requests

5 participants