-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a result that indicates the reconciliation is incomplete and does not trigger the exponential backoff logic #617
Comments
To add a bit of additional context it is currently very difficult to test controller-runtime based reconciliation loops with the current behavior if there are cases where only partial reconciliation is expected due to external dependencies that trigger reconciliation based on watches. Allowing for a case where Requeue is explicitly set to |
To summarize: you want an error condition that says "don't requeue" (i.e. what we've referred to as ignorable errors in #377)? |
/kind feature If so, can we move the discussion over there? |
Hi @DirectXMan12, Yes, but I also wonder now if this is the right way to do this. After speaking more about this with @detiber and @vincepri, I created this example that relies on asserting expected, eventual state of the objects. |
Yeah, we've thus far been recommending writing tests like that (using eventually and/or consistently), since it allows you to update your logic to occur over multiple reconciles, etc, w/o needing to update your tests. P.S. Ginkgo tip: Also, |
Yes, the problem is very similar to #377. The specific use case that is difficult to test for with the current model: We have some resources that we want to wait until they have an OwnerRef from a related resource prior to reconciling. Currently we have no way to test Reconcile in a way that tells us: 1) we haven't mutated the resource 2) We haven't completed a full reconciliation without requeueing. Since the update to the ownerRef would trigger a new reconciliation, requeueing is pretty pointless here. |
Hi @DirectXMan12, Thank you again for your suggestions. I implemented most of them here! |
I came through this and #377 as well. @DirectXMan12 Some thoughts: One approach:
Another approach:
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I'm tempted to lean towards tying this into ignorable errors, because it seems to be another side of the same die, if you will. There's a few of options there:
|
In the case of "ignorable errors", it's about translating "not found" to "decided not to do job until all dependencies are available" automatically. In this case, it seems like it's about attaching additional information about why we're requeuing -- the error says "we didn't actually do what you asked", which is useful for diagnostic and testing purposes (as opposed to "we did what you asked, and we're trying again for whatever reason"). Practical outcome wise, they're pretty much the same, but it's nice to be able to see why things occurred. |
I'm not sure I like the terminology "ignorable error", but the behavior would indeed fit our use case. The main thing we'd like to accomplish is a way to say that reconciliation wasn't completed (generally due to waiting on some dependency), but not to force a requeue since we would already have a watch registered for the resource(s) involved. I think both option 1 and 2 would fit our needs well, but maybe some other term instead of 'ignorable'? In our use case we aren't as much ignoring the error as much as trying to avoid unnecessary reconciliations of the resource since we'll get an update from the watch when we should re-reconcile. |
Sure, of course :-) If you have ideas, lmk |
If discussing a property of an Error, maybe something to the effect of 'do not requeue'. I'm not a huge fan of the negative, but it would help for the purposes of defaulting. For a test wrapper maybe |
sure, seems reasonable |
/kind design |
@vincepri: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-longterm |
Returning an error from a Reconciler seems to have limited value. It's hard to know what to do in response that's useful. I think we should consider changing the Reconciler interface so that it does not return an error at all. When a Reconciler returns an error, three things happen (I'm ignoring metrics for the moment):
(2) is hard to get right, as identified by this issue. It's hard to know which errors deserve a retry. Isn't it better to just let the Reconciler make its own decision, and set (3) I'm not sure what the value is in this. What's it accomplishing? (1) also seems limited in value. Should all errors be logged the same way? Do all errors deserve to be logged at all? Generally no. We could instead expect the Reconciler to do its own error handling and log a useful message if/when/how appropriate. An optional helper function to generically log errors from within the Reconciler would be just as useful as the log behavior today. Back to metrics, as it is today with a generic error count, it's not clear what is being measured. I don't think a broad error count, where an error could mean many different things, is actionable or particularly useful. A Reconciler implementation can capture its own metrics that are more meaningful. Lastly to testing. As already observed, it's hard to define what it means for a Reconciler run to be "incomplete". Many Reconcilers are designed to make incremental progress and re-run many times while converging to desired state. Perhaps what is most useful to communicate in terms of "completeness" is whether the Reconcile logic is blocked from progressing toward desired state. I'm not sure if there's a good generic way to capture that, or if that should be part of the Reconciler interface at all. Maybe that's best handled and tested as an implementation detail behind the Reconciler interface. In many cases when progress toward desired state is blocked, it's useful to communicate that on the object's Status. An example is when a required Secret is missing or has invalid credentials, because it's a problem that the API user can fix. In these cases, the Status is a natural place for a test to determine success or failure. In sum, when a Reconciler returns an error, it's hard to know what to do with it. Rather than have an interface that lets a Reconciler pass us an error and something that helps us understand how to handle it, we could just let the Reconciler handle it. |
I like @mhrivnak's proposal, however if we go down that path, it would be nice if there was a way in the result to distinguish between:
I'd rather not have to attempt to bolt on some type of backoff mechanism on top of the current result struct. |
Agreed. It seems that right now, setting |
I like this idea! As a small side note, I don't think you get backoff while using RequeueAfter even if Requeue is true (in the non-error case). |
That's correct (just double-checked the code) -- Requeue and RequeueAfter are mutually exclusive (/me grumbles about Go's lack of tagged unions): controller-runtime/pkg/internal/controller/controller.go Lines 262 to 275 in e00985b
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
In the reconciler, we considered a pending uninstall operation as an error. It resulted in slower reconciliation because of exponential backoff. To avoid the exponential backoff, we need to return the request with the requeueAfter value set. See: kubernetes-sigs/controller-runtime#617 Signed-off-by: Raghavendra Talur <raghavendra.talur@gmail.com>
- Reduce log noise by logging errors instead of successes - Use context logger provided by controller-runtime - Patch status instead of update to avoid "the object has been modified; please apply your changes to the latest version and try again" - Add finalizer even if object is already under deletion, in case we never got a chance yet - Don't set RequeueAfter on errors since it is ignored anyway [0] [0]: kubernetes-sigs/controller-runtime#617 Change-Id: Ic06aa74f9e1465d3f7e32137559231e940c8a74d
After discussing this with @detiber, we realized there's no good solution for the following case:
Instead the current logic is:
result.RequeueAfter > 0
then the request is added to the queue for processing after the value specified byresult.RequeueAfter
result.Requeue
istrue
then the request is added to the queue with the same exponential backoff logic used when an error is returnedToday there is currently no way to indicate a reconciliation is incomplete without also having the request requeued by the manager either via an explicit amount of time or the exponential backoff logic (due to error or
Requeue == true
).There should be a way to signal:
Thanks!
The text was updated successfully, but these errors were encountered: