Skip to content

Conversation

@jstudler
Copy link
Collaborator

In some cases, the Prefix and IpAddress reconcilers are stuck because a resource update prior to unlocking failed (e.g. due to concurrent edit in the kube-api). This could be e.g. the following line https://github.com/netbox-community/netbox-operator/blob/main/internal/controller/prefix_controller.go#L182

The next run of the reconciler will have nil for ll in https://github.com/netbox-community/netbox-operator/blob/main/internal/controller/prefix_controller.go#L203 and thus not try to unlock it. Example log output of the netbox-operator:

2024-09-12T11:48:16+02:00       ERROR   Reconciler error        {"controller": "prefix", "controllerGroup": "netbox.dev", "controllerKind": "Prefix", "Prefix": {"name":"example","namespace":"default"}, "namespace": "default", "name": "example", "reconcileID": "1c20d680-fc77-48fb-a949-730a20ebd83a", "error": "Operation cannot be fulfilled on prefixes.netbox.dev \"example\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /Users/joel/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /Users/joel/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /Users/joel/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224
2024-09-12T11:48:16+02:00       INFO    prefix reconcile loop started   {"controller": "prefix", "controllerGroup": "netbox.dev", "controllerKind": "Prefix", "Prefix": {"name":"example","namespace":"default"}, "namespace": "default", "name": "example", "reconcileID": "f12a1389-4102-4507-b5b0-21873447dc16"}

After this, the entire reconciler will be blocked because the next object trying to aquire the lock for the same parentPrefix will time out (to the configured 60 seconds) and thus all operations on the same CRs will be blocked for these 60 seconds (each reconcile loop is single threaded).

This code change enhances the unlocking by moving the leaselocker Unlock() call up in the reconciler to the position where it actually makes sense (right after the Prefix or IP was updated in NetBox). We also move up the status update right after unlocking in order not to lose the reference to the object in NetBox.

@jstudler jstudler self-assigned this Sep 12, 2024
@henrybear327 henrybear327 force-pushed the fix/enhance-unlocking branch 2 times, most recently from fbbfbbe to 78049aa Compare September 24, 2024 06:53
@jstudler jstudler requested a review from bruelea September 24, 2024 08:10
Copy link
Collaborator

@bruelea bruelea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Small note about the update of the status fields: If the update of the status fields fails the reference to the object in NetBox won't be lost, it will be added the next time the reconcile function runs.

@henrybear327 henrybear327 merged commit 110cbb7 into main Sep 26, 2024
5 checks passed
@henrybear327 henrybear327 deleted the fix/enhance-unlocking branch September 26, 2024 21:57
vaishutin pushed a commit to vaishutin/netbox-operator that referenced this pull request Aug 17, 2025
Enhance unlocking to be more reliable by moving it further up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants