-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: add rate limiting design doc #3178
Conversation
f96488d
to
4c1eb82
Compare
Codecov Report
@@ Coverage Diff @@
## main #3178 +/- ##
==========================================
- Coverage 75.34% 75.29% -0.05%
==========================================
Files 97 97
Lines 6097 6097
==========================================
- Hits 4594 4591 -3
- Misses 1396 1399 +3
Partials 107 107
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be neat to have some concrete examples of some HTTPProxy resources to get a better idea of what it might look like. For instance, limiting on a key, limiting on a header, limiting on remote IP, etc.
Also, how are the other fields in the spec configured (https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ratelimit/v3/rate_limit.proto)?
Can we rate limit on gRPC services?
design/ratelimit-design-new.md
Outdated
response: 50ms | ||
``` | ||
|
||
Second, a new block will be added to the Contour config file to identify the RLS extension service and provide some additional parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This assumes a single cluster will only have one RLS? (Maybe good to call out in the goals). I'm thinking possibly one team may want their own implementation of a global RLS different than another.
- It's unfortunate to need another config place. It seems like the ExtensionService should be where you define the domain, denyOnFailure, etc. Having an
RateLimitExtensionService
and aAuthorizationExtensionService
would solve this, otherwise we've got to have the extra setup in the config file, or tons of validation in the ExtensionService.
2b. What is the ExtensionService buying us here? I think the original intent was to allow a place to add status conditions, but I'm still unclear as to how that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This assumes a single cluster will only have one RLS? (Maybe good to call out in the goals). I'm thinking possibly one team may want their own implementation of a global RLS different than another.
The challenge is that (AFAICT) you can only have one RLS per HTTP connection manager, and all non-TLS virtual hosts share the same HTTP connection manager. External auth ran into the same constraint, and opted to support external auth only for TLS vhosts, so that you could have a separate auth service per vhost. We could make that same choice for rate limiting - allow global rate limiting only for TLS vhosts - but that seems a little strange to me. Definitely worth some discussion.
- It's unfortunate to need another config place. It seems like the ExtensionService should be where you define the domain, denyOnFailure, etc. Having an
RateLimitExtensionService
and aAuthorizationExtensionService
would solve this, otherwise we've got to have the extra setup in the config file, or tons of validation in the ExtensionService.
Yeah - if we could configure an RLS per vhost, then these things would go into a struct as part of HTTPProxy, like AuthorizationServer today. See previous comment, though.
2b. What is the ExtensionService buying us here? I think the original intent was to allow a place to add status conditions, but I'm still unclear as to how that happens.
Basically gets the programming of the Envoy cluster for the RLS for free. Other than that, there's not currently any infrastructure for doing health checks, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge is that (AFAICT) you can only have one RLS per HTTP connection manager
That makes sense, should just call it out in the doc so that is clear to readers. I don't think we should add a requirement that only TLS'd vhosts will get this service, for auth it makes sense to keep the security boundaries tight, but for rate limiting I don't think it's a big deal other than possibly someone could fake the requests to get around rate limiting?
Yeah - if we could configure an RLS per vhost, then these things would go into a struct as part of HTTPProxy, like AuthorizationServer today. See previous comment, though.
You could move some of the configuration items to the vhost, like domain, etc. Also might be a place to have global ratelimiting configuration per vhost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm increasingly thinking that we should bring some per-ExtensionService properties into the ExtensionService, like we already do with timeouts and load balancing policy. However, each type of ExtensionService almost always has extra parameters that belong to the service (in this case, they're currently listed as in the configuration file, and in the tracing case, they would probably need to be part of the vhost configuration currently), and each type of thing has a different set of things that need to be configured.
In this case, it's these two fields here: domain
and failOpen
, and when we do tracing (ie #399 ) there will be other service-specific things, that will be the same for every reference of that service. If we moved to doing this, the config file would just be for the ExtensionServiceRef rather than for config details, which would allow those config details to be flipped without Contour redeploys. Here I'm thinking of flipping between failOpen
states - that seems like a change that you would want to be able to back out quickly in the event of a problem.
If we could add some sort of (hah) extension point into ExtensionService, that would allow you to supply a set of fields or something, we could move this config into the relevant ExtensionService (where I feel like it kind of belongs, since it's really a property of that service).
It does seem like the second time we use ExtensionService is a good time to think about its general properties. However, I'd be okay with putting this as an open question, to be determined during implementation because it requires some experimentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could add some sort of (hah) extension point into ExtensionService, that would allow you to supply a set of fields or something, we could move this config into the relevant ExtensionService (where I feel like it kind of belongs, since it's really a property of that service).
This was my original suggestion when first building out ExensionService, don't make it generic, but make it specific to the need you are configuring. So you'd have an ExtensionServiceAuth
and ExtensionServiceRateLimit
, ExtensionServiceOauth2
, etc, etc. This way you can be clear about fields that need configured and not rely on conditions to say, "actually user, that field isn't meant for this context so don't set it".
There's a trade-off in complexity here. The overhead of having more CRDs vs a nicer user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the design looks great.
I think the big question that's outstanding is about ExtensionService, and what it is as we start using it in more places. I feel like it's a good place for two things:
- Status information about the configured service (that is, that it's Accepted/configured correctly at least, and one day, maybe, if it's Ready).
- Service-specific configuration, like the
failOpen
policy - this allows this to be quickly flipped (at the speed of changing a Kube object rather than the speed of redeploying Contour).
I said this in one of the other comments, but I think it's important enough to repeat - I think that the second use of ExtensionService is the right time to think about what a more general ExtensionService is and does.
design/ratelimit-design-new.md
Outdated
response: 50ms | ||
``` | ||
|
||
Second, a new block will be added to the Contour config file to identify the RLS extension service and provide some additional parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm increasingly thinking that we should bring some per-ExtensionService properties into the ExtensionService, like we already do with timeouts and load balancing policy. However, each type of ExtensionService almost always has extra parameters that belong to the service (in this case, they're currently listed as in the configuration file, and in the tracing case, they would probably need to be part of the vhost configuration currently), and each type of thing has a different set of things that need to be configured.
In this case, it's these two fields here: domain
and failOpen
, and when we do tracing (ie #399 ) there will be other service-specific things, that will be the same for every reference of that service. If we moved to doing this, the config file would just be for the ExtensionServiceRef rather than for config details, which would allow those config details to be flipped without Contour redeploys. Here I'm thinking of flipping between failOpen
states - that seems like a change that you would want to be able to back out quickly in the event of a problem.
If we could add some sort of (hah) extension point into ExtensionService, that would allow you to supply a set of fields or something, we could move this config into the relevant ExtensionService (where I feel like it kind of belongs, since it's really a property of that service).
It does seem like the second time we use ExtensionService is a good time to think about its general properties. However, I'd be okay with putting this as an open question, to be determined during implementation because it requires some experimentation.
design/ratelimit-design-new.md
Outdated
- The external RLS returns a response indicating whether the client request should be rate-limited or not. | ||
- If the request should be rate limited, a `429 Too Many Requests` response is returned to the client along with the `x-envoy-ratelimited` header. | ||
- If the request should not be rate limited, it is routed to the appropriate upstream cluster and proceeds as normal. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see some thoughts about what and how we communicate to the different personas, both cluster admin and application owner.
In the case of the application owner, it would be nice to be able to tell if I've hit the rate limit, or at least that the rate limit is configured and working correctly for my service.
In the case of the cluster operator, it seems like it's useful to be able to find the services that are consuming the rate limit, ideally with some information about the passed descriptors.
Basically, where is the status surfaced, and what can we make available? I suspect that getting information out of the RLS is not possible currently, but at least specifying what a HTTPProxy user should expect to see if there is a ratelimit seems useful.
In general, I'd like to see a little more info about failure cases. If the RLS fails, how do we know? Does Contour know at all?
It's okay to put those as open questions if we have to, but I feel like a extension design isn't complete without some detail about the UX of consuming it.
We probably will need to see if there's some metrics you can pull from Envoy about this, and include some information about that in the guide page as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it's these two fields here: domain and failOpen, and when we do tracing (ie #399 ) there will be other service-specific things, that will be the same for every reference of that service.
Without checking how Envoy configures this, I'd suggest that these are properties of the vhost, not properties of the rate limiting service. Just like auth, there is probably some transition where vhosts move from their own rate limiting to the platform one and would twiddle these kinds of knobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I should probably have checked. I agree that for things that are per-vhost, the config should live there, but for things that are properties of the service itself, they should end up on the ExtensionService object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'd like to see a little more info about failure cases. If the RLS fails, how do we know? Does Contour know at all?
I'm still confused about how we push status info into the ExtensionService when the external service is a black box (this applies to auth as well). Is the operator implementing the external service supposed to write status back to the ExtensionService?
I see two stats for RateLimiting that might be useful to understand if the rate limit was hit, otherwise you might be able to scrape the metrics for specific HTTP status codes to understand?
upstream_rq_retry_backoff_exponential | Counter | Total retries using the exponential backoff strategy
upstream_rq_retry_backoff_ratelimited | Counter | Total retries using the ratelimited backoff strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it's these two fields here: domain and failOpen, and when we do tracing (ie #399 ) there will be other service-specific things, that will be the same for every reference of that service.
Without checking how Envoy configures this, I'd suggest that these are properties of the vhost, not properties of the rate limiting service. Just like auth, there is probably some transition where vhosts move from their own rate limiting to the platform one and would twiddle these kinds of knobs.
The challenge here is that, if we want to support rate limiting for non-TLS vhosts, then the filter can't be per-vhost, since all non-TLS vhosts share the same HTTP connection manager + filter. That's why it's not at the vhost level.
If we decide not to support non-TLS vhosts, or can come up with some alternate way of getting around the single HCM/filter constraint, then these params could move under the vhost. With external auth, we decided to only support TLS vhosts for this reason, though since it was an auth thing, that made more sense.
I already looked into specifying a per-vhost filter config, but the global rate limiting filter doesn't seem to support that. I'm open to ideas here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as status, metrics are emitted by Envoy with rate limit info:
(Global) https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/rate_limit_filter#statistics
(Local) https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/local_rate_limit_filter#statistics
For global, looking at those should give a picture of if the service is configured correctly, as well as if requests are being limited. Do we want Contour to be scraping these and using them to populate info in the API somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some details on the stats that Envoy exposes. Given that we don't scrape Envoy metrics to pull into Contour anywhere else, I'm not going to propose doing that now; the user can pull those into a Grafana dashobard from Envoy directly to observe the runtime behavior. We can pursue this separately if we want, though we'd probably want to start with more fundamental metrics around traffic serving, rather than doing it just for rate limiting, IMHO.
Yep, we could add an opaque |
Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 90 days. |
8404f4e
to
95f241c
Compare
Updated this based on yesterday's discussion, around not currently supporting an RLS per TLS vhost. Let me know if there's any further feedback on the overall design. I'd like to get local rate limiting in for 1.12, and global for 1.13. I also think there's room for further improvement after the initial MVP work, so happy to consider additional items subsequently if we agree they're priorities (metrics scraping, lyft RLS operator, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a seriously excellent design document, can't wait to see the implementation. Nice work @skriss!
(one extremely small thing that is non-blocking).
Looks like the HTTP local rate limit filter just merged descriptor support into Envoy
Need to consider if/how this will fit into the Contour design. |
d9beb4d
to
0003376
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Updates projectcontour#370. Signed-off-by: Steve Kriss <krisss@vmware.com>
0003376
to
909eb7a
Compare
@stevesloka let me know if you've got anything else here. |
Signed-off-by: Steve Kriss <krisss@vmware.com>
|
||
New arguments to Contour: | ||
For global rate limiting, per Envoy's API, the `RateLimitPolicy` only defines a list of descriptors to send to an external RLS for a given request. | ||
All [descriptor entries defined by Envoy](https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#config-route-v3-ratelimit-action) will be supported except `metadata` and `dynamic_metadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe define what these all would be in the API? For example requestHeaders
can also have skip_if_absent
which isn't in the examples.
Also, I assume we're not going to support SourceCluster
& DestinationCluster
which might be new since this design doc was drafted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll spec out the details for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok to merge as-is, can do this work during implementation phase.
Updates #370.