Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEP: Conditions and Status update #1364

Closed
youngnick opened this issue Aug 26, 2022 · 23 comments · Fixed by #1560
Closed

GEP: Conditions and Status update #1364

youngnick opened this issue Aug 26, 2022 · 23 comments · Fixed by #1560
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/gep PRs related to Gateway Enhancement Proposal(GEP)
Milestone

Comments

@youngnick
Copy link
Contributor

We currently have a lot of outstanding issues related to status, mostly around Conditions, whether it be issues with specific Conditions being hard to understand, or general issues around Conditions. It's just because a lot of the status and Conditions design has grown very organically, what we effectively need is a design pass to make sure everything is consistent. There have been a few looks at this previously, but nothing coordinated across the whole API.

This issue covers a review of all the outstanding status issues, figuring out which ones relate, generating a TODO list, and then writing a GEP to cover all the changes.

Based on my initial quick pass, I think it's likely that this will include changes that will break conformance, but I think the longer we leave this, the worse the changes will be.

I propose making this a blocker for v0.6.0 release, so that we can get the pain out of the way now, and move forward with a consistent view of how our status flows work.

Here's my initial notes from doing a quick issue roundup.

Rough initial notes to link in relevant issues

In general, we need to make sure we're reusing the same Condition types across different resources, and that they have the same or similar definitions across resources as well.

The biggest review to date was #1111, and there are still some outstanding items from that.

GatewayClass has Accepted. This seems well-defined.

Gateway has Ready, which has significant issues, see #1156. This probably needs a follow-up to describe the change.

I think we definitely need to have a better description of what Ready means, and ensure that the definitions are consistent. @evankanderson has raised that it would be really useful for knative's use case for Ready to mean "Ready to pass traffic as soon as it's set", but I don't know the extent to which we can mandate that. We may be stuck with setting a "Ready means will be ready in <5sec" or something, but that will require implementation feedback.

We need to move the Listener condition to Attached instead of Detached, as described in #1110. Having a small set of standard positive-polarity (that is, they are status: true if everything is good) Conditions is useful, and similar things should have similar names.

#1112 also needs to be updated post the closing of #1211.

We also need to note if conditions should be added if they are in a happy state. The latest comments on #1315 is talking about this.
Personally, I prefer that conditions are always added, even if they are Unknown, but this one needs discussion.

#1362 should probably also be done as part of this.

@youngnick youngnick added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 26, 2022
@youngnick youngnick added this to the v0.6.0 milestone Aug 26, 2022
@youngnick youngnick self-assigned this Aug 26, 2022
@robscott
Copy link
Member

I was initially going to create a new issue for this, but will start here for now since I know this is meant to be a bit of an umbrella issue.

Closely related to your comments about a "Ready" condition on HTTPRoute, I think we really need something to fill the gap in our current conformance tests here. Currently we're limited to an "Accepted" condition with a rather vague definition:

// This condition indicates whether the route has been accepted or rejected
// by a Gateway, and why.

Unfortunately we're using that condition as a signal that HTTPRoutes are ready for conformance tests:

Type: string(v1alpha2.RouteConditionAccepted),
Status: metav1.ConditionTrue,
Reason: string(v1alpha2.RouteReasonAccepted),

That usually starts the 30 second timer for MakeRequestAndExpectEventuallyConsistentResponse. Unfortunately some implementations are going to have a longer gap than that between when they've "Accepted" a Route and when they've "Reconciled" it, unless we should be interpreting "Accepted" to also mean "Reconciled". It seems like we need to:

  1. More clearly define what "Accepted" means for Routes
  2. Add a new Route condition (Reconciled, Ready, or similar) that conformance tests wait for before making HTTP requests

@shaneutt
Copy link
Member

I'm glad you brought that one up Rob, as that's been one on my mind as well.

Right now I would personally characterize Accepted as also meaning Reconciled, at least colloquially. Defining it as anything else seems difficult because all implementations universally have a sense of "the dataplane has been configured" but everything in between that, and even some of the stuff after that seems more dubious as far as universality.

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

@robscott
Copy link
Member

robscott commented Aug 26, 2022

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

I think we probably need some change like this for conformance. The amount of time it takes for these changes to propagate is wildly dependent on what's being configured. For example, if we're rolling out changes globally, it may take longer for them to reach certain regions. We already allow these timeout to be configurable, maybe we just need to remove or increase the max value that we consider conformant here:

// Max value for conformant implementation: 30 seconds

@evankanderson
Copy link
Contributor

I have a somewhat-related concern, based on past history with different envoy-based implementations.

In some implementations, adding a new percentage rule with a new backend pool to an existing HTTPRoute can cause traffic to be dropped with 500s (no healthy upstream). The root cause is that the new route definitions arrive at some envoy pods before the cluster definitions do, and when traffic arrives at that route, there's no corresponding backend programmed.

It would be nice to have a set of rules as to when / how to avoid this scenario (dropping traffic on an already-programmed route when making changes). Per-implementation workarounds are going to be less fun in a Gateway world where e.g. Knative wants to be aware of a single abstraction, rather than inventorying all known implementations.

I mention this in the context of conditions as it seems like the might be Conditions involved in the solution to "don't drop traffic when adding backends under load".

@evankanderson
Copy link
Contributor

Do you think it's on the table to increase the timeout to something that supports all known implementations, and simply defining it as "configured in dataplane"?

I think we probably need some change like this for conformance. The amount of time it takes for these changes to propagate is wildly dependent on what's being configured. For example, if we're rolling out changes globally, it may take longer for them to reach certain regions. We already allow these timeout to be configurable, maybe we just need to remove or increase the max value that we consider conformant here:

// Max value for conformant implementation: 30 seconds

Would a Ready congestion separate from Accepted/Reconciled allow implementations to choose that 30s value for themselves? (Yes)

I think the question is whether implementations would know that timing duration themselves (i.e. it might differ depending on the number and location of nodes).

@youngnick
Copy link
Contributor Author

youngnick commented Aug 29, 2022

Okay, I've spent a whole lot of time reading through issues, code, and docs, and here are my current suggestions. I've bucketed them into more general things and specific things.

General things

We should make sure that, as far as possible, Conditions should have similar names and flow for each resource.

We should mandate the use of observedGeneration for Conditions, and check that it's set correctly in conformance. This gives the conditions a built-in staleness check with respect to spec, but is less helpful when other changes may trigger Condition changes (think here of Ready flapping if the dataplane breaks).

We should cover at least two states with positive-polarity, "check" conditions:

  • Accepted or Reconciled, or Attached, which mean that the resource has been seen by the controller and fits into a set of related Gateway API resources that are syntactically and semantically valid.
  • Ready, Programmed, or simlar, which mean that the resources have been scheduled, programmed, or otherwise provisioned onto underlying infrastructure and are either ready now or will be soon (where "soon" is currently TBD, but will be clearly defined and included in conformance). Note that this particular Condition may change state if the state of underlying infrastructure changes, so observedGeneration is less useful.

Other Conditions can be used, but must be negative-polarity, "error" conditions that describe specific error states that are not recoverable without human intervention. In general, these should add more information to the boolean "something is broken" information you will get from the positive-polarity summary Conditions. Existing examples here are ResolvedRefs or Conflicted.

Where that leads me is that we should try and settle on a way that we talk about how the API works, and have Conditions match that model. For example, we could lean into the current Attached name, and say that, rather than being Scheduled, Gateways attach to the GatewayClass implemented by their controller, and thus the Attached Condition is the one that means "this Gateway has been seen by its controller, and is syntactically and semantically valid (along with the stuff that it's linked to)." Then, Routes simliarly attach to the Gateway, and have a Condition that says they are Attached. Policy resources and ReferenceGrant resources may also be able to use Attached to describe the resources they attach to as metadata resources.

This is just an example, but the more we can use the same language to describe operations, the less conceptual overhead there is for using this API.

Specific things

Gateway

Addresses field in GatewaySpec needs an update to be explicit about Conditions.

Hostname field in the Listener struct needs an update to specify the Conditions in the hostname match failed case.

AllowedRoutes in Listener should also be updated to have details about what happens in the cases of various number of routes being included and invalid:

Route

The RouteParentStatus Conditions field has a bunch of details about how to handle Conditions - these should be moved to the ParentRefs field of CommonRouteSpec.

HTTPRoute

HTTPRouteRule Rules stanza should have documentation about how to handle:

  • 0 valid Rules
  • all valid rules
  • some valid rules, and some invalid.

(This should address #1112)

We should ensure that the existing backendRefs docs document this in a similar format.

backendRefs should also update with info about what Conditions to set in each case.

The same should be done for the rules that are still in v1alpha2.

ReferenceGrant

ReferenceGrant should have a status subresource added with a Condition that indicates if the resource is syntactically valid (probably Accepted). This needs an issue and agreement, and should probably block v0.6.0.

I'm keeping track of stuff I've found in this sheet: https://docs.google.com/spreadsheets/d/1E6hrmFqQLN90Ngs5m4IaYI4CORPrj4Ueal1HZyJ6C0w/edit?usp=sharing for now as well.

@youngnick
Copy link
Contributor Author

In some implementations, adding a new percentage rule with a new backend pool to an existing HTTPRoute can cause traffic to be dropped with 500s (no healthy upstream). The root cause is that the new route definitions arrive at some envoy pods before the cluster definitions do, and when traffic arrives at that route, there's no corresponding backend programmed.

It would be nice to have a set of rules as to when / how to avoid this scenario (dropping traffic on an already-programmed route when making changes). Per-implementation workarounds are going to be less fun in a Gateway world where e.g. Knative wants to be aware of a single abstraction, rather than inventorying all known implementations.

I agree that this is a problem, but I would like to sort out the basic flow first - it's a little orthogonal to the main thrust of this issue (which is that what we have at the moment is very inconsistent across resources).

However, for Envoy-based implementations, this is all about ordering the cache updates or implementing ADS (which enforces ordering as well).

@mikemorris
Copy link
Contributor

This is just an example, but the more we can use the same language to describe operations, the less conceptual overhead there is for using this API.

Big +1 to this ^

Accepted or Reconciled, or Attached, which mean that the resource has been seen by the controller and fits into a set of related Gateway API resources that are syntactically and semantically valid.

In #1111 I had initially considered whether Accepted and Attached would mean the same thing (and thus want to remove one of them), but I think the discussion in #1085 (comment) proposes a clear distinction between the two and a use case where they may be set to different values. Agreed about language referring to route/policy "attachment" makes this all a bit unclear though.

This condition [Attached] indicates that, even though the listener is syntactically and semantically valid, the controller is not able to configure it on the underlying Gateway infrastructure.

#1031 is likely relevant here too for where this behavior should be documented.

@evankanderson
Copy link
Contributor

With respect to observedGeneration, both Accepted and Ready might flip from True to Unknown or False without a spec change as other parts of a configuration tree change. (Deletion of a ReferenceGrant, for example.)

@evankanderson
Copy link
Contributor

Other Conditions can be used, but must be negative-polarity, "error" conditions that describe specific error states that are not recoverable without human intervention. In general, these should add more information to the boolean "something is broken" information you will get from the positive-polarity summary Conditions. Existing examples here are ResolvedRefs or Conflicted.

Two notes here:

  1. ResolvedRefs is positive, while Conflicted is negative. That's probably the high order bit to sort out.
  2. If you mix positive and negative polarities. (Accepted and Ready are positive, others are negative), every UI which wants to display the status of Gateway resources as dots/check marks/etc will need to bake in that rule. "UI" here includes both web UIs as well as CLI tools.

@evankanderson
Copy link
Contributor

Policy resources and ReferenceGrant resources may also be able to use Attached to describe the resources they attach to as metadata resources.

I think that you don't want Conditions on Policy resources. In particular, it's possible (and reasonable/handy) to have a ReferenceGrant which grants two different Gateway objects with different controllers access to all Services in a namespace.

@evankanderson
Copy link
Contributor

One additional state that might or might not be Condition-worthy: when there is at least one resource which is blocked from attachment due to Policy (in comparison with "resource does not exist" reasons).

My rationale here is that debugging this kind of failure is likely to be common, and so a little guidance / hand-holding for new users / users in a rush might be helpful.

Maybe this is part of Accepted, but we should consider whether "Gateway references a non-existent Secret in a different namespace" is the same Condition / error message as "Gateway references a Secret in a namespace which it is not allowed to use".

Among other things, I'm guessing that the conformance spec might have different things to say about how the Gateway tree of configuration as a whole works in those error conditions.

@evankanderson
Copy link
Contributor

One last thought:

Given the ability to delegate the configuration tree to different namespaces, you may want to provide different Condition views of the overall configuration tree on the HTTPRoute objects then on the Gateway object, for three reasons:

  1. limit status updates on state change
  2. avoid confusing/unable to assist failure reports when resources in a different namespace are broken
  3. avoid leaking resource information between namespaces for users limited to a single namespace

@youngnick
Copy link
Contributor Author

I think any conditions we put on ReferenceGrant will need to be namespaced by controllerName, just like most of the other status.

@youngnick
Copy link
Contributor Author

Good thoughts as always, @evankanderson, much appreciated.

With respect to observedGeneration, both Accepted and Ready might flip from True to Unknown or False without a spec change as other parts of a configuration tree change. (Deletion of a ReferenceGrant, for example.)

I think we should work to try and make deletion of a ReferenceGrant not induce things to flip their Attached state - we've been moving in that direction a bit, but I think it's worth checking the rules for this specific case. If we can keep it that only Ready flips without a spec change, I think we're in a better place.

About Condition polarity:

Two notes here:

  1. ResolvedRefs is positive, while Conflicted is negative. That's probably the high order bit to sort out.

Agreed.

  1. If you mix positive and negative polarities. (Accepted and Ready are positive, others are negative), every UI which wants to display the status of Gateway resources as dots/check marks/etc will need to bake in that rule. "UI" here includes both web UIs as well as CLI tools.

In the absence of some way to tell which polarity Conditions are (if only someone had suggested doing that 😛 ), our options are either:

  • Mandate all positive Conditions
  • Allow mixed ones, and keep the number of positive ones low, and have common Types, so that UIs can special-case-handle them easily

I don't include "make them all negative" because that sounds even worse than what we have now.

The second option has definitely been my plan, on the assumption that we'll try and keep the number of condition Types relatively low as well, but particularly the positive polarity ones.

Thanks for the last thought about having HTTPRoute have a different view - I agree that the critical thing is that the HTTPRoute Conditions have the information that the HTTPRoute owner needs, and no more. HTTPRoute owners shouldn't be able to use the status to learn about things in another namespace they haven't asked to use.

@youngnick
Copy link
Contributor Author

Okay, I think the outstanding item here is for me to write a GEP that details the changes to be made. I'll skip the preliminary "what and why", "merge a preliminary GEP that doesn't discuss the how" stage, since I think we've hashed that out in this issue, and write a full GEP on this one, with as many changes as I can find all in one spot.

@youngnick youngnick changed the title Conditions and Status review GEP: Conditions and Status update Sep 1, 2022
@youngnick youngnick added the kind/gep PRs related to Gateway Enhancement Proposal(GEP) label Sep 1, 2022
@youngnick
Copy link
Contributor Author

Okay, I've marked this issue as a GEP now, so I'll prepare a PR for GEP-1364 - Status and Conditions Update.

@mikemorris
Copy link
Contributor

Should this be closed now that #1383 has merged, or should it stay open until implementation is complete?

@robscott
Copy link
Member

Should this be closed now that #1383 has merged, or should it stay open until implementation is complete?

I'm not really sure what we should do. Generally we've been leaving these open until they're implemented and documented, but not until they reach beta/GA. In upstream (k/enhancements), tracking issues are left open for all KEPs until they are either removed or graduate to GA. We may want to move to that approach. I've got a related PR that tries to more clearly document our approach here: #1440.

@shaneutt
Copy link
Member

If the dilemma is whether to keep it open until it releases, I say close it. It's tracked against v0.6.0 milestone that should be sufficient to indicate in what release it's done.

@robscott
Copy link
Member

@shaneutt At a minimum, I want to leave a GEP tracking bug open until it's met the criteria to be released. If we close out without actually implementing or documenting it, we may accidentally release a feature that's only half of the way there.

@sunjayBhatia
Copy link
Member

relevant to this GEP/Issue: #1521

@youngnick
Copy link
Contributor Author

When #1540 merges, this is done enough that we can consider it implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/gep PRs related to Gateway Enhancement Proposal(GEP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants