Skip to content

Comments

Test CI with different controller-runtime versions#1393

Closed
matheuscscp wants to merge 3 commits intomainfrom
test-cr-022-e2e
Closed

Test CI with different controller-runtime versions#1393
matheuscscp wants to merge 3 commits intomainfrom
test-cr-022-e2e

Conversation

@matheuscscp
Copy link
Member

Summary

Test commit to verify the upgrade-fail-remediate-uninstall e2e test passes with controller-runtime v0.22.4 (before the v0.23.0 upgrade).

This will help verify the theory that the test was passing due to controller-runtime v0.22.4's behavior and fails with v0.23.0.

Test plan

  • Watch CI results for the upgrade-fail-remediate-uninstall test

🤖 Generated with Claude Code

matheuscscp and others added 2 commits January 27, 2026 18:33
This is a test commit to verify the upgrade-fail-remediate-uninstall e2e test
passes with controller-runtime v0.22.4 (before the v0.23.0 upgrade).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Testing if the upgrade-fail-remediate-uninstall e2e test fails with
controller-runtime v0.23.0 due to stricter terminal error handling.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@matheuscscp matheuscscp changed the title Test CI with controller-runtime v0.22.4 Test CI with different controller-runtime versions Jan 27, 2026
@matheuscscp
Copy link
Member Author

Root Cause Analysis: upgrade-fail-remediate-uninstall e2e Test Failure

Summary

The test failure is caused by a behavioral difference between controller-runtime v0.22.4 and v0.23.0, specifically due to the priority queue being enabled by default in v0.23.0 (PR #3332).

Evidence

  1. First commit (controller-runtime v0.22.4): Test passed ✅
  2. Second commit (controller-runtime v0.23.0): Test failed ❌

Queue Behavior Difference

Standard Workqueue (v0.22.4)

The standard workqueue has a dirty set mechanism in client-go/util/workqueue/queue.go:

// Add: If item is being processed, mark as dirty (don't queue yet)
func (q *Typed[T]) Add(item T) {
    q.dirty.Insert(item)
    if q.processing.Has(item) {
        return  // Don't push to queue - just mark dirty
    }
    q.queue.Push(item)
}

// Done: Re-queue if item was added during processing
func (q *Typed[T]) Done(item T) {
    q.processing.Delete(item)
    if q.dirty.Has(item) {
        q.queue.Push(item)  // AUTO RE-QUEUE!
    }
}

Priority Queue (v0.23.0)

The priority queue in controller-runtime/pkg/controller/priorityqueue does NOT have this mechanism:

// Done: Just unlock the item, no automatic re-queue
func (w *priorityqueue[T]) Done(item T) {
    w.locked.Delete(item)
    w.notifyReadyItemOrWaiterAdded()
    // NO dirty check, NO automatic re-queue
}

How This Affects the Test

The upgrade-fail-remediate-uninstall test expects this flow:

  1. Initial install succeeds
  2. Upgrade fails → upgradeFailures == 1
  3. Uninstall remediation runs (removes the release)
  4. Fresh install is attempted and fails → installFailures == 1

The problem is in step 3→4. After uninstall remediation, the controller returns a terminal error (due to early stall detection checking upgrade retries). Terminal errors don't trigger automatic requeue.

In v0.22.4: The standard workqueue's dirty mechanism (or similar de-duplication behavior) caused the item to be re-queued despite the terminal error, allowing step 4 to happen.

In v0.23.0: The priority queue's stricter de-duplication prevents this re-queue, so step 4 never happens and the test times out.

CI Log Evidence

v0.22.4 (passing):

00:32:29.210Z - Reconciler error: "terminal error: exceeded maximum retries"
00:32:52.154Z - New reconciliation starts (23s later)
00:32:52.173Z - running 'install' action

v0.23.0 (failing):

13:01:58.411Z - Reconciler error: "terminal error: exceeded maximum retries"
(no subsequent reconciliation - test times out after 2 minutes)

Conclusion

The test was passing in v0.22.4 due to an unintended side effect of the standard workqueue's behavior, not because the underlying logic was correct. The priority queue in v0.23.0 exposes a pre-existing bug in the early stall detection logic after uninstall remediation.

The fix will be addressed in a separate PR.

@matheuscscp matheuscscp deleted the test-cr-022-e2e branch January 27, 2026 19:25
@matheuscscp
Copy link
Member Author

@sbueringer @alvaroaleman Since we have just discussed this, can you please confirm if the analysis above sounds correct?

@alvaroaleman
Copy link

Its not correct, both queues have locking that prevents hand out of an item that was already handed out and where Done() wasn't called yet, its processing in upstream and locked in the PQ.

The upstream queue handles the situation you are describing by not putting the item in the queue from which it serves Gets() and only doing that when Done() was called, the PQ handles it by skipping them in handleReady which is what gets triggered by notifyReadyItemOrWaiterAdded(), which means if a locked item was the only one available it will be handed out right after Done() was called.

@matheuscscp
Copy link
Member Author

Thanks! Any insights about what could be different between the two queues that was causing this unintended side-effect on a terminal error?

@matheuscscp
Copy link
Member Author

Or maybe something else entirely unrelated to the queues between v0.22.4 and v0.23 that could explain this apparent second reconciliation that was hiding this bug in helm-controller?

@alvaroaleman
Copy link

I don't know enough about the helm-controller to answer your question. The terminal error flow is unrelated to the workqueue, all it does is is skip the Add call on the workqueue which means the mechanism works the same regardless of what queue you use.

If you could add a debug wrapper around the queue that prints Add calls I can have another look.

@matheuscscp
Copy link
Member Author

If you could add a debug wrapper around the queue that prints Add calls I can have another look.

Using v0.22.4, I assume?

@alvaroaleman
Copy link

Using v0.22.4, I assume?

EIther, the sequence of adds shouldn't change. It doesn't only need to print Add but also AddAfter and AddRatelimited which might be easier with the PQ as it all ends up in AddWithOpts there

@alvaroaleman
Copy link

Or maybe something else entirely unrelated to the queues between v0.22.4 and v0.23 that could explain this apparent second reconciliation that was hiding this bug in helm-controller?

You can a/b test that by using 0.23 but disabling the priority queue, its just a default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@matheuscscp matheuscscp reopened this Jan 27, 2026
@matheuscscp
Copy link
Member Author

Yeah, as the last commit shows, the behavior difference is in the PQ... I'm now debugging locally with your suggestion of wrapping the queue, will report if I find anything

@matheuscscp
Copy link
Member Author

Debug Findings

I added a debug wrapper around the queue to log Add, AddAfter, AddRateLimited, Get, and Done calls. Here's what I found:

Timeline with Priority Queue (v0.23)

20:12:51.605Z - AddAfter scheduled for 30s (should fire at ~20:13:21.6Z)
20:12:51.605Z - Done (first reconcile completes after successful install)
20:12:57.622Z - Add called (upgrade manifest applied by user)
20:12:57.623Z - Get (upgrade reconcile starts)
20:13:01.575Z - Done (terminal error: exceeded maximum retries)
~20:13:21.6Z - AddAfter should fire but NEVER DOES!

Result

  • With priority queue: upgradeFailures: 1, installFailures: null
  • With standard workqueue: upgradeFailures: 1, installFailures: 1

Analysis

The AddAfter scheduled before the upgrade reconcile (at 20:12:51) is being cancelled when the Add is called at 20:12:57. With the standard workqueue, this AddAfter would fire after the terminal error and trigger a new reconcile that attempts (and fails) the install.

This confirms the priority queue handles pending AddAfter timers differently than the standard workqueue when an Add is called for the same item.

@matheuscscp
Copy link
Member Author

Oops, looks like I already found something ( 🤖 😅 )

@alvaroaleman
Copy link

yeah, PQ de-duplicates AddAfter with Add, upstream wq does not. That this is missing in the uptream wq is a bug: kubernetes/kubernetes#110642

Looking at your timeline, clearly there was a change between the Get() and the Done() that returned a terminal error that causes the determination of returning a terminal error to not be correct anymore. That change needs to emit an event and everything will be fine.

@matheuscscp
Copy link
Member Author

Thanks very much for all the help here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants