Skip to content

Conversation

@Danil-Grigorev
Copy link

@Danil-Grigorev Danil-Grigorev commented Mar 27, 2025

In the existing dynamic watcher setup there is an issue with namespace controller predicate setup. In case when a cluster existed in a namespace before the addon provider controller starts, it does not get reconciled. It does not occur for clusters created in the matching namespace, after controller start.

This change removes label predicate which fixes the problem in testing, as well as addresses an issue occurring during namespace deletion.

Removing namespace with an imported cluster causes deletion to get stuck on not yet removed CAAPF finalizer. This is due to patch method creating an v1.Event resource in the current namespace, which fails when the namespace is marked for deletion, preventing latter finalizer removal.

Signed-off-by: Danil-Grigorev <danil.grigorev@suse.com>
Copy link
Contributor

@yiannistri yiannistri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when the resource is marked for deletion any updates are forbidden, which I think makes sense. If I understood this correctly, the main fix for the finalizer that gets stuck is to ignore forbidden errors after publishing create/update event.

@Danil-Grigorev
Copy link
Author

@yiannistri it is a combination of 2 things. Event is a resource which is created in the namespace, and while it is being deleted it errors out before finalizer removal. That is just something I noticed accidentally. Finalizers are a pain everywhere, much rather would not rely on them, but some resources which need to be removed are outside of cluster namespace, so resource ownership is not an option.

The actual issue is fixed by dropping labels predicate on the namespace controller and waiting for the controllers to start watching on dynamic events stream. This ensures that every event would get to a consumer controller, while previously if a namespace was queued twice and had no label change, it would not reach the reconcile.

I have a suspicion that it is something which may have also happened on error requeue, so the whole setup had only one chance of succeeding. I would need to confirm it though.

@Danil-Grigorev Danil-Grigorev added this pull request to the merge queue Mar 28, 2025
Merged via the queue into rancher:main with commit 6caa96f Mar 28, 2025
11 checks passed
@Danil-Grigorev Danil-Grigorev deleted the delay-until-first-subscriber branch March 28, 2025 10:25
Danil-Grigorev added a commit to Danil-Grigorev/cluster-api-addon-provider-fleet that referenced this pull request Mar 28, 2025
…#232)

Signed-off-by: Danil-Grigorev <danil.grigorev@suse.com>
Danil-Grigorev added a commit that referenced this pull request Mar 28, 2025
…234)

Signed-off-by: Danil-Grigorev <danil.grigorev@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants