GEP: Conditions and Status update #1364

youngnick · 2022-08-26T00:18:30Z

We currently have a lot of outstanding issues related to status, mostly around Conditions, whether it be issues with specific Conditions being hard to understand, or general issues around Conditions. It's just because a lot of the status and Conditions design has grown very organically, what we effectively need is a design pass to make sure everything is consistent. There have been a few looks at this previously, but nothing coordinated across the whole API.

This issue covers a review of all the outstanding status issues, figuring out which ones relate, generating a TODO list, and then writing a GEP to cover all the changes.

Based on my initial quick pass, I think it's likely that this will include changes that will break conformance, but I think the longer we leave this, the worse the changes will be.

I propose making this a blocker for v0.6.0 release, so that we can get the pain out of the way now, and move forward with a consistent view of how our status flows work.

Here's my initial notes from doing a quick issue roundup.

Rough initial notes to link in relevant issues

In general, we need to make sure we're reusing the same Condition types across different resources, and that they have the same or similar definitions across resources as well.

The biggest review to date was #1111, and there are still some outstanding items from that.

GatewayClass has Accepted. This seems well-defined.

Gateway has Ready, which has significant issues, see #1156. This probably needs a follow-up to describe the change.

I think we definitely need to have a better description of what Ready means, and ensure that the definitions are consistent. @evankanderson has raised that it would be really useful for knative's use case for Ready to mean "Ready to pass traffic as soon as it's set", but I don't know the extent to which we can mandate that. We may be stuck with setting a "Ready means will be ready in <5sec" or something, but that will require implementation feedback.

We need to move the Listener condition to Attached instead of Detached, as described in #1110. Having a small set of standard positive-polarity (that is, they are status: true if everything is good) Conditions is useful, and similar things should have similar names.

#1112 also needs to be updated post the closing of #1211.

We also need to note if conditions should be added if they are in a happy state. The latest comments on #1315 is talking about this.
Personally, I prefer that conditions are always added, even if they are Unknown, but this one needs discussion.

#1362 should probably also be done as part of this.

The text was updated successfully, but these errors were encountered:

robscott · 2022-08-26T19:09:50Z

I was initially going to create a new issue for this, but will start here for now since I know this is meant to be a bit of an umbrella issue.

Closely related to your comments about a "Ready" condition on HTTPRoute, I think we really need something to fill the gap in our current conformance tests here. Currently we're limited to an "Accepted" condition with a rather vague definition:

gateway-api/apis/v1beta1/shared_types.go

Lines 190 to 191 in eba0e7e

    
           // This condition indicates whether the route has been accepted or rejected 
        
           // by a Gateway, and why.

Unfortunately we're using that condition as a signal that HTTPRoutes are ready for conformance tests:

gateway-api/conformance/utils/kubernetes/helpers.go

Lines 167 to 169 in eba0e7e

    
           Type:   string(v1alpha2.RouteConditionAccepted), 
        
           Status: metav1.ConditionTrue, 
        
           Reason: string(v1alpha2.RouteReasonAccepted),

That usually starts the 30 second timer for MakeRequestAndExpectEventuallyConsistentResponse. Unfortunately some implementations are going to have a longer gap than that between when they've "Accepted" a Route and when they've "Reconciled" it, unless we should be interpreting "Accepted" to also mean "Reconciled". It seems like we need to:

More clearly define what "Accepted" means for Routes
Add a new Route condition (Reconciled, Ready, or similar) that conformance tests wait for before making HTTP requests

shaneutt · 2022-08-26T19:32:22Z

I'm glad you brought that one up Rob, as that's been one on my mind as well.

Right now I would personally characterize Accepted as also meaning Reconciled, at least colloquially. Defining it as anything else seems difficult because all implementations universally have a sense of "the dataplane has been configured" but everything in between that, and even some of the stuff after that seems more dubious as far as universality.

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

robscott · 2022-08-26T20:46:43Z

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

I think we probably need some change like this for conformance. The amount of time it takes for these changes to propagate is wildly dependent on what's being configured. For example, if we're rolling out changes globally, it may take longer for them to reach certain regions. We already allow these timeout to be configurable, maybe we just need to remove or increase the max value that we consider conformant here:

gateway-api/conformance/utils/config/timeout.go

Line 59 in eba0e7e

// Max value for conformant implementation: 30 seconds

evankanderson · 2022-08-28T11:37:34Z

I have a somewhat-related concern, based on past history with different envoy-based implementations.

In some implementations, adding a new percentage rule with a new backend pool to an existing HTTPRoute can cause traffic to be dropped with 500s (no healthy upstream). The root cause is that the new route definitions arrive at some envoy pods before the cluster definitions do, and when traffic arrives at that route, there's no corresponding backend programmed.

It would be nice to have a set of rules as to when / how to avoid this scenario (dropping traffic on an already-programmed route when making changes). Per-implementation workarounds are going to be less fun in a Gateway world where e.g. Knative wants to be aware of a single abstraction, rather than inventorying all known implementations.

I mention this in the context of conditions as it seems like the might be Conditions involved in the solution to "don't drop traffic when adding backends under load".

evankanderson · 2022-08-28T11:55:31Z

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

I think we probably need some change like this for conformance. The amount of time it takes for these changes to propagate is wildly dependent on what's being configured. For example, if we're rolling out changes globally, it may take longer for them to reach certain regions. We already allow these timeout to be configurable, maybe we just need to remove or increase the max value that we consider conformant here:

gateway-api/conformance/utils/config/timeout.go

Line 59 in eba0e7e

// Max value for conformant implementation: 30 seconds

Would a Ready congestion separate from Accepted/Reconciled allow implementations to choose that 30s value for themselves? (Yes)

I think the question is whether implementations would know that timing duration themselves (i.e. it might differ depending on the number and location of nodes).

youngnick · 2022-08-29T04:03:56Z

Okay, I've spent a whole lot of time reading through issues, code, and docs, and here are my current suggestions. I've bucketed them into more general things and specific things.

General things

We should make sure that, as far as possible, Conditions should have similar names and flow for each resource.

We should mandate the use of observedGeneration for Conditions, and check that it's set correctly in conformance. This gives the conditions a built-in staleness check with respect to spec, but is less helpful when other changes may trigger Condition changes (think here of Ready flapping if the dataplane breaks).

We should cover at least two states with positive-polarity, "check" conditions:

Accepted or Reconciled, or Attached, which mean that the resource has been seen by the controller and fits into a set of related Gateway API resources that are syntactically and semantically valid.
Ready, Programmed, or simlar, which mean that the resources have been scheduled, programmed, or otherwise provisioned onto underlying infrastructure and are either ready now or will be soon (where "soon" is currently TBD, but will be clearly defined and included in conformance). Note that this particular Condition may change state if the state of underlying infrastructure changes, so observedGeneration is less useful.

Other Conditions can be used, but must be negative-polarity, "error" conditions that describe specific error states that are not recoverable without human intervention. In general, these should add more information to the boolean "something is broken" information you will get from the positive-polarity summary Conditions. Existing examples here are ResolvedRefs or Conflicted.

Where that leads me is that we should try and settle on a way that we talk about how the API works, and have Conditions match that model. For example, we could lean into the current Attached name, and say that, rather than being Scheduled, Gateways attach to the GatewayClass implemented by their controller, and thus the Attached Condition is the one that means "this Gateway has been seen by its controller, and is syntactically and semantically valid (along with the stuff that it's linked to)." Then, Routes simliarly attach to the Gateway, and have a Condition that says they are Attached. Policy resources and ReferenceGrant resources may also be able to use Attached to describe the resources they attach to as metadata resources.

This is just an example, but the more we can use the same language to describe operations, the less conceptual overhead there is for using this API.

Specific things

Gateway

Addresses field in GatewaySpec needs an update to be explicit about Conditions.

Hostname field in the Listener struct needs an update to specify the Conditions in the hostname match failed case.

AllowedRoutes in Listener should also be updated to have details about what happens in the cases of various number of routes being included and invalid:

No valid routes attached.
All routes attached are valid.
At least one valid route is attached, and 1 or more invalid routes are attached.
Is a Route with 0 valid rules valid? I'm incined to say yes.
(This should address Route handling of invalid BackendRefs needs clarification #1112 )

Route

The RouteParentStatus Conditions field has a bunch of details about how to handle Conditions - these should be moved to the ParentRefs field of CommonRouteSpec.

HTTPRoute

HTTPRouteRule Rules stanza should have documentation about how to handle:

0 valid Rules
all valid rules
some valid rules, and some invalid.

(This should address #1112)

We should ensure that the existing backendRefs docs document this in a similar format.

backendRefs should also update with info about what Conditions to set in each case.

The same should be done for the rules that are still in v1alpha2.

ReferenceGrant

ReferenceGrant should have a status subresource added with a Condition that indicates if the resource is syntactically valid (probably Accepted). This needs an issue and agreement, and should probably block v0.6.0.

I'm keeping track of stuff I've found in this sheet: https://docs.google.com/spreadsheets/d/1E6hrmFqQLN90Ngs5m4IaYI4CORPrj4Ueal1HZyJ6C0w/edit?usp=sharing for now as well.

youngnick · 2022-08-29T04:06:48Z

In some implementations, adding a new percentage rule with a new backend pool to an existing HTTPRoute can cause traffic to be dropped with 500s (no healthy upstream). The root cause is that the new route definitions arrive at some envoy pods before the cluster definitions do, and when traffic arrives at that route, there's no corresponding backend programmed.

It would be nice to have a set of rules as to when / how to avoid this scenario (dropping traffic on an already-programmed route when making changes). Per-implementation workarounds are going to be less fun in a Gateway world where e.g. Knative wants to be aware of a single abstraction, rather than inventorying all known implementations.

I agree that this is a problem, but I would like to sort out the basic flow first - it's a little orthogonal to the main thrust of this issue (which is that what we have at the moment is very inconsistent across resources).

However, for Envoy-based implementations, this is all about ordering the cache updates or implementing ADS (which enforces ordering as well).

mikemorris · 2022-08-29T16:29:20Z

This is just an example, but the more we can use the same language to describe operations, the less conceptual overhead there is for using this API.

Big +1 to this ^

Accepted or Reconciled, or Attached, which mean that the resource has been seen by the controller and fits into a set of related Gateway API resources that are syntactically and semantically valid.

In #1111 I had initially considered whether Accepted and Attached would mean the same thing (and thus want to remove one of them), but I think the discussion in #1085 (comment) proposes a clear distinction between the two and a use case where they may be set to different values. Agreed about language referring to route/policy "attachment" makes this all a bit unclear though.

This condition [Attached] indicates that, even though the listener is syntactically and semantically valid, the controller is not able to configure it on the underlying Gateway infrastructure.

#1031 is likely relevant here too for where this behavior should be documented.

evankanderson · 2022-08-30T10:30:42Z

With respect to observedGeneration, both Accepted and Ready might flip from True to Unknown or False without a spec change as other parts of a configuration tree change. (Deletion of a ReferenceGrant, for example.)

evankanderson · 2022-08-30T10:35:56Z

Other Conditions can be used, but must be negative-polarity, "error" conditions that describe specific error states that are not recoverable without human intervention. In general, these should add more information to the boolean "something is broken" information you will get from the positive-polarity summary Conditions. Existing examples here are ResolvedRefs or Conflicted.

Two notes here:

ResolvedRefs is positive, while Conflicted is negative. That's probably the high order bit to sort out.
If you mix positive and negative polarities. (Accepted and Ready are positive, others are negative), every UI which wants to display the status of Gateway resources as dots/check marks/etc will need to bake in that rule. "UI" here includes both web UIs as well as CLI tools.

evankanderson · 2022-08-30T10:40:29Z

Policy resources and ReferenceGrant resources may also be able to use Attached to describe the resources they attach to as metadata resources.

I think that you don't want Conditions on Policy resources. In particular, it's possible (and reasonable/handy) to have a ReferenceGrant which grants two different Gateway objects with different controllers access to all Services in a namespace.

evankanderson · 2022-08-30T10:48:50Z

One additional state that might or might not be Condition-worthy: when there is at least one resource which is blocked from attachment due to Policy (in comparison with "resource does not exist" reasons).

My rationale here is that debugging this kind of failure is likely to be common, and so a little guidance / hand-holding for new users / users in a rush might be helpful.

Maybe this is part of Accepted, but we should consider whether "Gateway references a non-existent Secret in a different namespace" is the same Condition / error message as "Gateway references a Secret in a namespace which it is not allowed to use".

Among other things, I'm guessing that the conformance spec might have different things to say about how the Gateway tree of configuration as a whole works in those error conditions.

evankanderson · 2022-08-30T10:55:17Z

One last thought:

Given the ability to delegate the configuration tree to different namespaces, you may want to provide different Condition views of the overall configuration tree on the HTTPRoute objects then on the Gateway object, for three reasons:

limit status updates on state change
avoid confusing/unable to assist failure reports when resources in a different namespace are broken
avoid leaking resource information between namespaces for users limited to a single namespace

youngnick · 2022-08-31T07:14:08Z

I think any conditions we put on ReferenceGrant will need to be namespaced by controllerName, just like most of the other status.

youngnick · 2022-08-31T07:24:30Z

Good thoughts as always, @evankanderson, much appreciated.

With respect to observedGeneration, both Accepted and Ready might flip from True to Unknown or False without a spec change as other parts of a configuration tree change. (Deletion of a ReferenceGrant, for example.)

I think we should work to try and make deletion of a ReferenceGrant not induce things to flip their Attached state - we've been moving in that direction a bit, but I think it's worth checking the rules for this specific case. If we can keep it that only Ready flips without a spec change, I think we're in a better place.

About Condition polarity:

Two notes here:

ResolvedRefs is positive, while Conflicted is negative. That's probably the high order bit to sort out.

Agreed.

If you mix positive and negative polarities. (Accepted and Ready are positive, others are negative), every UI which wants to display the status of Gateway resources as dots/check marks/etc will need to bake in that rule. "UI" here includes both web UIs as well as CLI tools.

In the absence of some way to tell which polarity Conditions are (if only someone had suggested doing that 😛 ), our options are either:

Mandate all positive Conditions
Allow mixed ones, and keep the number of positive ones low, and have common Types, so that UIs can special-case-handle them easily

I don't include "make them all negative" because that sounds even worse than what we have now.

The second option has definitely been my plan, on the assumption that we'll try and keep the number of condition Types relatively low as well, but particularly the positive polarity ones.

Thanks for the last thought about having HTTPRoute have a different view - I agree that the critical thing is that the HTTPRoute Conditions have the information that the HTTPRoute owner needs, and no more. HTTPRoute owners shouldn't be able to use the status to learn about things in another namespace they haven't asked to use.

youngnick · 2022-08-31T07:26:18Z

Okay, I think the outstanding item here is for me to write a GEP that details the changes to be made. I'll skip the preliminary "what and why", "merge a preliminary GEP that doesn't discuss the how" stage, since I think we've hashed that out in this issue, and write a full GEP on this one, with as many changes as I can find all in one spot.

youngnick · 2022-09-01T03:34:15Z

Okay, I've marked this issue as a GEP now, so I'll prepare a PR for GEP-1364 - Status and Conditions Update.

mikemorris · 2022-10-10T19:00:05Z

Should this be closed now that #1383 has merged, or should it stay open until implementation is complete?

robscott · 2022-10-10T19:21:56Z

Should this be closed now that #1383 has merged, or should it stay open until implementation is complete?

I'm not really sure what we should do. Generally we've been leaving these open until they're implemented and documented, but not until they reach beta/GA. In upstream (k/enhancements), tracking issues are left open for all KEPs until they are either removed or graduate to GA. We may want to move to that approach. I've got a related PR that tries to more clearly document our approach here: #1440.

shaneutt · 2022-10-10T19:26:11Z

If the dilemma is whether to keep it open until it releases, I say close it. It's tracked against v0.6.0 milestone that should be sufficient to indicate in what release it's done.

robscott · 2022-10-10T19:34:23Z

@shaneutt At a minimum, I want to leave a GEP tracking bug open until it's met the criteria to be released. If we close out without actually implementing or documenting it, we may accidentally release a feature that's only half of the way there.

sunjayBhatia · 2022-11-09T19:29:33Z

relevant to this GEP/Issue: #1521

youngnick · 2022-11-17T04:44:32Z

When #1540 merges, this is done enough that we can consider it implemented.

youngnick added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 26, 2022

youngnick added this to the v0.6.0 milestone Aug 26, 2022

youngnick self-assigned this Aug 26, 2022

youngnick mentioned this issue Aug 26, 2022

Clarify RefNotPermitted and InvalidCertificateRef reasons in API Spec #1362

Closed

youngnick mentioned this issue Aug 29, 2022

Documenting supported versions #1349

Merged

youngnick changed the title ~~Conditions and Status review~~ GEP: Conditions and Status update Sep 1, 2022

youngnick added the kind/gep PRs related to Gateway Enhancement Proposal(GEP) label Sep 1, 2022

This was referenced Sep 9, 2022

Add GEP-1364 proposal #1383

Merged

Update HTTPRoute and other status definitions and documentation #1077

Closed

Indicate a new Route "Readiness" condition #1156

Closed

This was referenced Oct 10, 2022

add ListenerConditionAccepted, update docs from "Detached" #1446

Merged

add GatewayConditionAccepted and GatewayReasonAccepted #1447

Merged

youngnick mentioned this issue Oct 11, 2022

Fix last round of feedback on GEP-1364 #1451

Merged

youngnick mentioned this issue Nov 29, 2022

Update GEP-1364 to Standard implementation #1560

Merged

k8s-ci-robot closed this as completed in #1560 Dec 7, 2022

dprotaso mentioned this issue Dec 7, 2022

Update conformance test to assert observedGeneration is bumped #1584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEP: Conditions and Status update #1364

GEP: Conditions and Status update #1364

youngnick commented Aug 26, 2022

robscott commented Aug 26, 2022

shaneutt commented Aug 26, 2022

robscott commented Aug 26, 2022 •

edited

Loading

evankanderson commented Aug 28, 2022

evankanderson commented Aug 28, 2022

youngnick commented Aug 29, 2022 •

edited

Loading

youngnick commented Aug 29, 2022

mikemorris commented Aug 29, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

youngnick commented Aug 31, 2022

youngnick commented Aug 31, 2022

youngnick commented Aug 31, 2022

youngnick commented Sep 1, 2022

mikemorris commented Oct 10, 2022

robscott commented Oct 10, 2022

shaneutt commented Oct 10, 2022

robscott commented Oct 10, 2022

sunjayBhatia commented Nov 9, 2022

youngnick commented Nov 17, 2022

GEP: Conditions and Status update #1364

GEP: Conditions and Status update #1364

Comments

youngnick commented Aug 26, 2022

Rough initial notes to link in relevant issues

robscott commented Aug 26, 2022

shaneutt commented Aug 26, 2022

robscott commented Aug 26, 2022 • edited Loading

evankanderson commented Aug 28, 2022

evankanderson commented Aug 28, 2022

youngnick commented Aug 29, 2022 • edited Loading

General things

Specific things

Gateway

Route

HTTPRoute

ReferenceGrant

youngnick commented Aug 29, 2022

mikemorris commented Aug 29, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

evankanderson commented Aug 30, 2022

youngnick commented Aug 31, 2022

youngnick commented Aug 31, 2022

youngnick commented Aug 31, 2022

youngnick commented Sep 1, 2022

mikemorris commented Oct 10, 2022

robscott commented Oct 10, 2022

shaneutt commented Oct 10, 2022

robscott commented Oct 10, 2022

sunjayBhatia commented Nov 9, 2022

youngnick commented Nov 17, 2022

robscott commented Aug 26, 2022 •

edited

Loading

youngnick commented Aug 29, 2022 •

edited

Loading