Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using patch to set finalizers might lead to having reconciled resources that are never finalized #2828

Closed
pierDipi opened this issue Sep 20, 2023 · 16 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pierDipi
Copy link
Member

pierDipi commented Sep 20, 2023

Expected Behavior

Once ReconcileKind is called, it should be guaranteed that FinalizeKind will be called

Actual Behavior

When a resource is created and very quickly deleted while is being reconciled, the FinalizeKind method is never called even though the ReconcileKind method was called once.

Here are the logs relevant to the issue:

  • reconciler starts reconciling the resource (the top level Reconcile(key string) method)
  • resource is deleted
  • resource finalizer is successfully set (even on a deleted resource) using PATCH
  • reconciler reconciles the resource (ReconcileKind)
  • FinalizeKind is never called (I think) because the finalizer at the time of the deletion wasn't present, so the object is gone

Controller logs:

knative-eventing.kafka-controller-575c6db9cb-r5fzn.controller.log:{"level":"info","ts":"2023-09-11T09:24:20.503Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:285","msg":"Event(v1.ObjectReference{Kind:\"Consumer\", Namespace:\"ks-erlst-multi-ksvc-0\", Name:\"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\", UID:\"3278232b-1f99-4286-a295-328fb2ca91a1\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"1082428\", FieldPath:\"\"}): type: 'Normal' reason: 'FinalizerUpdate' Updated \"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\" finalizers","knative.dev/pod":"kafka-controller-575c6db9cb-r5fzn"}
knative-eventing.kafka-controller-575c6db9cb-r5fzn.controller.log:{"level":"info","ts":"2023-09-11T09:24:20.551Z","logger":"kafka-broker-controller","caller":"controller/controller.go:550","msg":"Reconcile succeeded","knative.dev/pod":"kafka-controller-575c6db9cb-r5fzn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"3f547f0e-d36c-4c4d-bd09-41e371f2ab30","knative.dev/key":"ks-erlst-multi-ksvc-0/73d369d4-bf91-46d6-82b2-88477f9b0d30-w69f2","duration":0.000008909}
knative-eventing.kafka-controller-575c6db9cb-r5fzn.controller.log:{"level":"warn","ts":"2023-09-11T09:24:20.587Z","logger":"kafka-broker-controller","caller":"consumer/reconciler.go:277","msg":"Failed to update resource status","knative.dev/pod":"kafka-controller-575c6db9cb-r5fzn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"27ef104c-5bfa-4ca1-9926-8ac7d58d133e","knative.dev/key":"ks-erlst-multi-ksvc-0/73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg","targetMethod":"ReconcileKind","error":"consumers.internal.kafka.eventing.knative.dev \"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\" not found"}
knative-eventing.kafka-controller-575c6db9cb-r5fzn.controller.log:{"level":"error","ts":"2023-09-11T09:24:20.588Z","logger":"kafka-broker-controller","caller":"controller/controller.go:566","msg":"Reconcile error","knative.dev/pod":"kafka-controller-575c6db9cb-r5fzn","knative.dev/controller":"knative.dev.eventing-kafka-broker.control-plane.pkg.reconciler.consumer.Reconciler","knative.dev/kind":"internal.kafka.eventing.knative.dev.Consumer","knative.dev/traceid":"27ef104c-5bfa-4ca1-9926-8ac7d58d133e","knative.dev/key":"ks-erlst-multi-ksvc-0/73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg","duration":0.101347085,"error":"consumers.internal.kafka.eventing.knative.dev \"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\t/remote-source/knative-kafka-broker/app/vendor/knative.dev/pkg/controller/controller.go:566\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/remote-source/knative-kafka-broker/app/vendor/knative.dev/pkg/controller/controller.go:543\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\t/remote-source/knative-kafka-broker/app/vendor/knative.dev/pkg/controller/controller.go:491"}
knative-eventing.kafka-controller-575c6db9cb-r5fzn.controller.log:{"level":"info","ts":"2023-09-11T09:24:20.588Z","logger":"kafka-broker-controller.event-broadcaster","caller":"record/event.go:285","msg":"Event(v1.ObjectReference{Kind:\"Consumer\", Namespace:\"ks-erlst-multi-ksvc-0\", Name:\"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\", UID:\"3278232b-1f99-4286-a295-328fb2ca91a1\", APIVersion:\"internal.kafka.eventing.knative.dev/v1alpha1\", ResourceVersion:\"1082428\", FieldPath:\"\"}): type: 'Warning' reason: 'UpdateFailed' Failed to update status for \"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\": consumers.internal.kafka.eventing.knative.dev \"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg\" not found","knative.dev/pod":"kafka-controller-575c6db9cb-r5fzn"}

Webhook logs deleted action on resource twice:

ks-erlst-multi-ksvc-0/knative-eventing.kafka-webhook-eventing-5656bc46bd-gvrl9.kafka-webhook-eventing.log:{"level":"info","ts":"2023-09-11T09:24:20.496Z","logger":"kafka-webhook-eventing","caller":"webhook/admission.go:151","msg":"remote admission controller audit annotations=map[string]string(nil)","knative.dev/pod":"kafka-webhook-eventing-5656bc46bd-gvrl9","knative.dev/kind":"internal.kafka.eventing.knative.dev/v1alpha1, Kind=Consumer","knative.dev/namespace":"ks-erlst-multi-ksvc-0","knative.dev/name":"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg","knative.dev/operation":"DELETE","knative.dev/resource":"internal.kafka.eventing.knative.dev/v1alpha1, Resource=consumers","knative.dev/subresource":"","knative.dev/userinfo":"system:serviceaccount:knative-eventing:kafka-controller","admissionreview/uid":"645028f4-c6b5-4091-b95f-98509a4890b6","admissionreview/allowed":true,"admissionreview/result":"nil"}
ks-erlst-multi-ksvc-0/knative-eventing.kafka-webhook-eventing-5656bc46bd-gvrl9.kafka-webhook-eventing.log:{"level":"info","ts":"2023-09-11T09:24:20.526Z","logger":"kafka-webhook-eventing","caller":"webhook/admission.go:151","msg":"remote admission controller audit annotations=map[string]string(nil)","knative.dev/pod":"kafka-webhook-eventing-5656bc46bd-gvrl9","knative.dev/kind":"internal.kafka.eventing.knative.dev/v1alpha1, Kind=Consumer","knative.dev/namespace":"ks-erlst-multi-ksvc-0","knative.dev/name":"73d369d4-bf91-46d6-82b2-88477f9b0d30-6cdbg","knative.dev/operation":"DELETE","knative.dev/resource":"internal.kafka.eventing.knative.dev/v1alpha1, Resource=consumers","knative.dev/subresource":"","knative.dev/userinfo":"system:serviceaccount:knative-eventing:kafka-controller","admissionreview/uid":"81315fdc-1ccc-4b12-968e-468a3397bfe4","admissionreview/allowed":true,"admissionreview/result":"nil"}

Note: the timestamps are coming from different pods, so clock skew might be present

Steps to Reproduce the Problem

This is generally hard to reproduce consistently but you would need to:

  1. Create resource
  2. Delete resource

Additional Info

/kind bug

@pierDipi
Copy link
Member Author

cc @dprotaso

@dprotaso
Copy link
Member

What events appeared on the informer watch in this instance?

@pierDipi
Copy link
Member Author

I don't exactly know but every event goes through the same generated reconciler: https://github.com/knative-extensions/eventing-kafka-broker/blob/5c4461d0a616401a983136602d09a95fbff4e514/control-plane/pkg/reconciler/consumer/controller.go#L71C2-L71C81

consumerInformer.Informer().AddEventHandler(controller.HandleAll(impl.Enqueue))

@dprotaso
Copy link
Member

Can't really recommend anything if we don't know what the API watch is reporting

@pierDipi
Copy link
Member Author

but even if the informer sees the delete event, we cannot rely on that since it's not guaranteed to be observed if controllers are down in that time period, right?

@pierDipi
Copy link
Member Author

pierDipi commented Sep 25, 2023

I think the problem here is that the way we set finalizers with patch doesn't have optimistic concurrency control, so it's not guaranteed that a resource hasn't been deleted before setting the finalizers and after having fetched it from the API server.

@dprotaso
Copy link
Member

I'm wondering if the informer even gets an event after the finalizer is added

I think the problem here is that the way we set finalizers with patch doesn't have optimistic concurrency control

That's a good point - we can add the resourceVersion to the patch to ensure optimistic concurrency

@pierDipi
Copy link
Member Author

Is that possible?

From https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/

Patch: Patch will apply a change to a specific field. How the change is merged is defined per field. Lists may either be replaced or merged. Merging lists will not preserve ordering.
Patches will never cause optimistic locking failures, and the last write will win. Patches are recommended when the full state is not read before an update, or when failing on optimistic locking is undesirable. When patching complex types, arrays and maps, how the patch is applied is defined on a per-field basis and may either replace the field's current value, or merge the contents into the current value.

@pierDipi
Copy link
Member Author

I think we can just emit an update, I can't see a reason to use Patch there, I've tried on one generated reconciler

index ffc2b022..9df5a804 100644
--- a/client/injection/apiextensions/reconciler/apiextensions/v1/customresourcedefinition/reconciler.go
+++ b/client/injection/apiextensions/reconciler/apiextensions/v1/customresourcedefinition/reconciler.go
@@ -20,7 +20,6 @@ package customresourcedefinition
 
 import (
        context "context"
-       json "encoding/json"
        fmt "fmt"
 
        zap "go.uber.org/zap"
@@ -364,25 +363,14 @@ func (r *reconcilerImpl) updateFinalizersFiltered(ctx context.Context, resource
                finalizers = existingFinalizers.List()
        }
 
-       mergePatch := map[string]interface{}{
-               "metadata": map[string]interface{}{
-                       "finalizers":      finalizers,
-                       "resourceVersion": existing.ResourceVersion,
-               },
-       }
-
-       patch, err := json.Marshal(mergePatch)
-       if err != nil {
-               return resource, err
-       }
+       existing.Finalizers = finalizers
 
-       patcher := r.Client.ApiextensionsV1().CustomResourceDefinitions()
+       updater := r.Client.ApiextensionsV1().CustomResourceDefinitions()
 
-       resourceName := resource.Name
-       updated, err := patcher.Patch(ctx, resourceName, types.MergePatchType, patch, metav1.PatchOptions{})
+       updated, err := updater.Update(ctx, existing, metav1.UpdateOptions{})
        if err != nil {
                r.Recorder.Eventf(existing, corev1.EventTypeWarning, "FinalizerUpdateFailed",
-                       "Failed to update finalizers for %q: %v", resourceName, err)
+                       "Failed to update finalizers for %q: %v", existing.Name, err)
        } else {
                r.Recorder.Eventf(updated, corev1.EventTypeNormal, "FinalizerUpdate",
                        "Updated %q finalizers", resource.GetName())

@dprotaso
Copy link
Member

Is that possible?

Yup - we had it in serving but dropped it because we didn't need the guarantees

ref: knative/serving#6211

You can also add a metadata.uid to ensure you're adding the finalizer to the correct object as well - this could be useful as well

ref: kubernetes/kubernetes#86320

@pierDipi
Copy link
Member Author

hmm, we use the same idea here in pkg

"resourceVersion": existing.ResourceVersion,
but is it really working the way we expect it?

@dprotaso
Copy link
Member

You can test it out - it was working beforehand - hence the update conflicts and markus wanting to remove it

@pierDipi
Copy link
Member Author

In that case, I can't really explain why the finalization wasn't done and the object was deleted without being properly finalized

@pierDipi
Copy link
Member Author

But as you see from the logs, while we were reconciling the object, it was deleted without finalization and that's why the status update failed with "resource not found"

@dprotaso
Copy link
Member

I think I need more data to know what the right approach is - given what you said then I don't think your solution of doing an update will solve this problem.

I'd reach out to the sig-api-machinery on the Kubernetes slack for input.

Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants