Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable Contour to process RLS even if it's created after Contour startup #4155

Open
skriss opened this issue Nov 4, 2021 · 6 comments
Open
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. Hacktoberfest Denotes an issue ready for any "Hacktoberfest" contributor. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@skriss
Copy link
Member

skriss commented Nov 4, 2021

It would be neat to allow Contour to start and pick up on the extension service if its created later on.

Originally posted by @stevesloka in #4153 (comment)

@skriss skriss added kind/feature Categorizes issue or PR as related to a new feature. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Nov 4, 2021
@skriss skriss added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Oct 11, 2022
@pnbrown pnbrown added the Hacktoberfest Denotes an issue ready for any "Hacktoberfest" contributor. label Oct 11, 2022
@vishal-chdhry
Copy link
Member

Hi @skriss! I would love to work on this.
Can you please help me understand what I have to do.

@skriss
Copy link
Member Author

skriss commented Nov 2, 2022

Hey @vishal-chdhry - looking some more at this, I'm actually thinking we should leave this as-is / close this issue. While we could add logic to allow the rate limit service to be resolved as part of DAG builds, it's unclear what should be done if the service does not exist. Right now, Contour will crash if the configured rate limit service ref cannot be resolved at startup. If we moved to resolving that ref at DAG build time, then it would be harder to provide feedback to the user -- we wouldn't want to mark every HTTPProxy as "Invalid" since that would drop traffic, but just putting something in the log is not very visible either. Absent a bigger redesign here, I think the way we have it is probably the best approach for now.

cc @sunjayBhatia, any thoughts on this?

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Nov 3, 2022

The big questions it seems are:

  • What do users see if the RLS configuration changes/becomes invalid?
  • How do we make configuration issues observable so they can be rectified?

Some different scenarios:

The ExtensionService resource becomes invalid (in the fields we validate at startup, which isn't everything) OR The ExtensionService is deleted
To realize these changes in Contour configuration, you need to actually restart Contour with a new ConfigMap to pick up the new ExtensionService version. So on purpose restarts/upgrades or general pod restarts in normal usage can trigger this. If the ExtensionService is invalid, Contour doesn't start and the operator has an action to perform, xDS configuration should stay static so should be no user impact.

The ExtensionService resource changes and is not invalid, just different OR The operator which ExtensionService resource is used for RLS
This could mean however that restarted instances of Contour serve different RLS configuration to different Envoys (if not part of an intentional update) which could be an impact to users.

This is a problem in general I'm realizing with named resources in a static config at startup (ConfigMap or CRD based) that aren't re-reconciled. The first case is especially problematic because there is even less intention here than the operator changing the ConfigMap or CRD to point to a different resource.

The ExtensionService resource becomes invalid in another way
There are quite a few fields that are validated in the ExtensionService processor that aren't at Contour startup. Currently if the ExtensionService for RLS becomes malformed during Contour runtime (or if was honestly invalid from the start), e.g. the backing Service no longer exists, status is set on it to indicate that and it is not added to the configured set of ExtensionService clusters. However, we never re-lookup the ExtensionService after startup. So we continue on when configuring xDS and think there is a gRPC Service for it in the RLS filter config, I believe pointing to a Cluster that doesn't exist.

From here if the RLS is unavailable Envoy will return a 500 response code, which is not bad, but does mean we effectively do drop traffic.

This is actually sort of the same for ExtAuth config, we don't check the Status of the ExtensionService referenced in an HTTPProxy but do check that an ExtensionCluster is in the DAG (so effectively check if the referenced ExtensionService is valid). If not there, we mark the HTTPProxy as Invalid and continue processing which also drops traffic, but has an additional place Status is kept to notify users.

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Nov 3, 2022

So from the above:

A static startup config that references a k8s resource (that could change), is something to think about in general, but again requires a bigger redesign most likely

Given the RLS configuration is invalid, our current situation does have some observability for the operator and also does likely have a user impact. One could argue what we have is correct or in favor of instead just serving traffic as-is w/o trying to rate limit if config is invalid. Maybe that is a toggle option. Aside from that or a larger change, we could do some refinements during HTTPProxy processing to be more intentional about how things work when the RLS ExtensionService changes, but we haven't received any feedback (I can remember) from users that things are not working well so maybe no need to make changes.

@sunjayBhatia
Copy link
Member

I have thoughts on how we might do this better in a more Gateway API-centric world, but that is a separate conversation topic.

@Pawan-Bishnoi
Copy link

this is good to be closed in that case? 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. Hacktoberfest Denotes an issue ready for any "Hacktoberfest" contributor. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

5 participants