-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEP: Conditions and Status update #1364
Comments
I was initially going to create a new issue for this, but will start here for now since I know this is meant to be a bit of an umbrella issue. Closely related to your comments about a "Ready" condition on HTTPRoute, I think we really need something to fill the gap in our current conformance tests here. Currently we're limited to an "Accepted" condition with a rather vague definition: gateway-api/apis/v1beta1/shared_types.go Lines 190 to 191 in eba0e7e
Unfortunately we're using that condition as a signal that HTTPRoutes are ready for conformance tests: gateway-api/conformance/utils/kubernetes/helpers.go Lines 167 to 169 in eba0e7e
That usually starts the 30 second timer for MakeRequestAndExpectEventuallyConsistentResponse. Unfortunately some implementations are going to have a longer gap than that between when they've "Accepted" a Route and when they've "Reconciled" it, unless we should be interpreting "Accepted" to also mean "Reconciled". It seems like we need to:
|
I'm glad you brought that one up Rob, as that's been one on my mind as well. Right now I would personally characterize Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"? |
I think we probably need some change like this for conformance. The amount of time it takes for these changes to propagate is wildly dependent on what's being configured. For example, if we're rolling out changes globally, it may take longer for them to reach certain regions. We already allow these timeout to be configurable, maybe we just need to remove or increase the max value that we consider conformant here:
|
I have a somewhat-related concern, based on past history with different envoy-based implementations. In some implementations, adding a new percentage rule with a new backend pool to an existing HTTPRoute can cause traffic to be dropped with 500s (no healthy upstream). The root cause is that the new route definitions arrive at some envoy pods before the cluster definitions do, and when traffic arrives at that route, there's no corresponding backend programmed. It would be nice to have a set of rules as to when / how to avoid this scenario (dropping traffic on an already-programmed route when making changes). Per-implementation workarounds are going to be less fun in a Gateway world where e.g. Knative wants to be aware of a single abstraction, rather than inventorying all known implementations. I mention this in the context of conditions as it seems like the might be Conditions involved in the solution to "don't drop traffic when adding backends under load". |
Would a I think the question is whether implementations would know that timing duration themselves (i.e. it might differ depending on the number and location of nodes). |
Okay, I've spent a whole lot of time reading through issues, code, and docs, and here are my current suggestions. I've bucketed them into more general things and specific things. General thingsWe should make sure that, as far as possible, Conditions should have similar names and flow for each resource. We should mandate the use of We should cover at least two states with positive-polarity, "check" conditions:
Other Conditions can be used, but must be negative-polarity, "error" conditions that describe specific error states that are not recoverable without human intervention. In general, these should add more information to the boolean "something is broken" information you will get from the positive-polarity summary Conditions. Existing examples here are Where that leads me is that we should try and settle on a way that we talk about how the API works, and have Conditions match that model. For example, we could lean into the current This is just an example, but the more we can use the same language to describe operations, the less conceptual overhead there is for using this API. Specific thingsGatewayAddresses field in GatewaySpec needs an update to be explicit about Conditions. Hostname field in the Listener struct needs an update to specify the Conditions in the hostname match failed case. AllowedRoutes in Listener should also be updated to have details about what happens in the cases of various number of routes being included and invalid:
RouteThe RouteParentStatus Conditions field has a bunch of details about how to handle Conditions - these should be moved to the ParentRefs field of CommonRouteSpec. HTTPRouteHTTPRouteRule Rules stanza should have documentation about how to handle:
(This should address #1112) We should ensure that the existing backendRefs docs document this in a similar format. backendRefs should also update with info about what Conditions to set in each case. The same should be done for the rules that are still in v1alpha2. ReferenceGrantReferenceGrant should have a status subresource added with a Condition that indicates if the resource is syntactically valid (probably I'm keeping track of stuff I've found in this sheet: https://docs.google.com/spreadsheets/d/1E6hrmFqQLN90Ngs5m4IaYI4CORPrj4Ueal1HZyJ6C0w/edit?usp=sharing for now as well. |
I agree that this is a problem, but I would like to sort out the basic flow first - it's a little orthogonal to the main thrust of this issue (which is that what we have at the moment is very inconsistent across resources). However, for Envoy-based implementations, this is all about ordering the cache updates or implementing ADS (which enforces ordering as well). |
Big +1 to this ^
In #1111 I had initially considered whether
#1031 is likely relevant here too for where this behavior should be documented. |
With respect to |
Two notes here:
|
I think that you don't want Conditions on Policy resources. In particular, it's possible (and reasonable/handy) to have a |
One additional state that might or might not be Condition-worthy: when there is at least one resource which is blocked from attachment due to Policy (in comparison with "resource does not exist" reasons). My rationale here is that debugging this kind of failure is likely to be common, and so a little guidance / hand-holding for new users / users in a rush might be helpful. Maybe this is part of Among other things, I'm guessing that the conformance spec might have different things to say about how the Gateway tree of configuration as a whole works in those error conditions. |
One last thought: Given the ability to delegate the configuration tree to different namespaces, you may want to provide different Condition views of the overall configuration tree on the HTTPRoute objects then on the Gateway object, for three reasons:
|
I think any conditions we put on ReferenceGrant will need to be namespaced by |
Good thoughts as always, @evankanderson, much appreciated.
I think we should work to try and make deletion of a ReferenceGrant not induce things to flip their About Condition polarity:
Agreed.
In the absence of some way to tell which polarity Conditions are (if only someone had suggested doing that 😛 ), our options are either:
I don't include "make them all negative" because that sounds even worse than what we have now. The second option has definitely been my plan, on the assumption that we'll try and keep the number of condition Types relatively low as well, but particularly the positive polarity ones. Thanks for the last thought about having HTTPRoute have a different view - I agree that the critical thing is that the HTTPRoute Conditions have the information that the HTTPRoute owner needs, and no more. HTTPRoute owners shouldn't be able to use the status to learn about things in another namespace they haven't asked to use. |
Okay, I think the outstanding item here is for me to write a GEP that details the changes to be made. I'll skip the preliminary "what and why", "merge a preliminary GEP that doesn't discuss the how" stage, since I think we've hashed that out in this issue, and write a full GEP on this one, with as many changes as I can find all in one spot. |
Okay, I've marked this issue as a GEP now, so I'll prepare a PR for GEP-1364 - Status and Conditions Update. |
Should this be closed now that #1383 has merged, or should it stay open until implementation is complete? |
I'm not really sure what we should do. Generally we've been leaving these open until they're implemented and documented, but not until they reach beta/GA. In upstream (k/enhancements), tracking issues are left open for all KEPs until they are either removed or graduate to GA. We may want to move to that approach. I've got a related PR that tries to more clearly document our approach here: #1440. |
If the dilemma is whether to keep it open until it releases, I say close it. It's tracked against v0.6.0 milestone that should be sufficient to indicate in what release it's done. |
@shaneutt At a minimum, I want to leave a GEP tracking bug open until it's met the criteria to be released. If we close out without actually implementing or documenting it, we may accidentally release a feature that's only half of the way there. |
relevant to this GEP/Issue: #1521 |
When #1540 merges, this is done enough that we can consider it implemented. |
We currently have a lot of outstanding issues related to status, mostly around Conditions, whether it be issues with specific Conditions being hard to understand, or general issues around Conditions. It's just because a lot of the status and Conditions design has grown very organically, what we effectively need is a design pass to make sure everything is consistent. There have been a few looks at this previously, but nothing coordinated across the whole API.
This issue covers a review of all the outstanding status issues, figuring out which ones relate, generating a TODO list, and then writing a GEP to cover all the changes.
Based on my initial quick pass, I think it's likely that this will include changes that will break conformance, but I think the longer we leave this, the worse the changes will be.
I propose making this a blocker for v0.6.0 release, so that we can get the pain out of the way now, and move forward with a consistent view of how our status flows work.
Here's my initial notes from doing a quick issue roundup.
Rough initial notes to link in relevant issues
In general, we need to make sure we're reusing the same Condition types across different resources, and that they have the same or similar definitions across resources as well.
The biggest review to date was #1111, and there are still some outstanding items from that.
GatewayClass has
Accepted
. This seems well-defined.Gateway has
Ready
, which has significant issues, see #1156. This probably needs a follow-up to describe the change.I think we definitely need to have a better description of what
Ready
means, and ensure that the definitions are consistent. @evankanderson has raised that it would be really useful for knative's use case forReady
to mean "Ready to pass traffic as soon as it's set", but I don't know the extent to which we can mandate that. We may be stuck with setting a "Ready
means will be ready in <5sec" or something, but that will require implementation feedback.We need to move the Listener condition to
Attached
instead ofDetached
, as described in #1110. Having a small set of standard positive-polarity (that is, they arestatus: true
if everything is good) Conditions is useful, and similar things should have similar names.#1112 also needs to be updated post the closing of #1211.
We also need to note if conditions should be added if they are in a happy state. The latest comments on #1315 is talking about this.
Personally, I prefer that conditions are always added, even if they are
Unknown
, but this one needs discussion.#1362 should probably also be done as part of this.
The text was updated successfully, but these errors were encountered: