Test CI with different controller-runtime versions by matheuscscp · Pull Request #1393 · fluxcd/helm-controller

matheuscscp · 2026-01-27T18:34:23Z

Summary

Test commit to verify the upgrade-fail-remediate-uninstall e2e test passes with controller-runtime v0.22.4 (before the v0.23.0 upgrade).

This will help verify the theory that the test was passing due to controller-runtime v0.22.4's behavior and fails with v0.23.0.

Test plan

Watch CI results for the upgrade-fail-remediate-uninstall test

🤖 Generated with Claude Code

This is a test commit to verify the upgrade-fail-remediate-uninstall e2e test passes with controller-runtime v0.22.4 (before the v0.23.0 upgrade). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Testing if the upgrade-fail-remediate-uninstall e2e test fails with controller-runtime v0.23.0 due to stricter terminal error handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

matheuscscp · 2026-01-27T19:24:34Z

Root Cause Analysis: `upgrade-fail-remediate-uninstall` e2e Test Failure

Summary

The test failure is caused by a behavioral difference between controller-runtime v0.22.4 and v0.23.0, specifically due to the priority queue being enabled by default in v0.23.0 (PR #3332).

Evidence

First commit (controller-runtime v0.22.4): Test passed ✅
Second commit (controller-runtime v0.23.0): Test failed ❌

Queue Behavior Difference

Standard Workqueue (v0.22.4)

The standard workqueue has a dirty set mechanism in client-go/util/workqueue/queue.go:

// Add: If item is being processed, mark as dirty (don't queue yet)
func (q *Typed[T]) Add(item T) {
    q.dirty.Insert(item)
    if q.processing.Has(item) {
        return  // Don't push to queue - just mark dirty
    }
    q.queue.Push(item)
}

// Done: Re-queue if item was added during processing
func (q *Typed[T]) Done(item T) {
    q.processing.Delete(item)
    if q.dirty.Has(item) {
        q.queue.Push(item)  // AUTO RE-QUEUE!
    }
}

Priority Queue (v0.23.0)

The priority queue in controller-runtime/pkg/controller/priorityqueue does NOT have this mechanism:

// Done: Just unlock the item, no automatic re-queue
func (w *priorityqueue[T]) Done(item T) {
    w.locked.Delete(item)
    w.notifyReadyItemOrWaiterAdded()
    // NO dirty check, NO automatic re-queue
}

How This Affects the Test

The upgrade-fail-remediate-uninstall test expects this flow:

Initial install succeeds
Upgrade fails → upgradeFailures == 1
Uninstall remediation runs (removes the release)
Fresh install is attempted and fails → installFailures == 1

The problem is in step 3→4. After uninstall remediation, the controller returns a terminal error (due to early stall detection checking upgrade retries). Terminal errors don't trigger automatic requeue.

In v0.22.4: The standard workqueue's dirty mechanism (or similar de-duplication behavior) caused the item to be re-queued despite the terminal error, allowing step 4 to happen.

In v0.23.0: The priority queue's stricter de-duplication prevents this re-queue, so step 4 never happens and the test times out.

CI Log Evidence

v0.22.4 (passing):

00:32:29.210Z - Reconciler error: "terminal error: exceeded maximum retries"
00:32:52.154Z - New reconciliation starts (23s later)
00:32:52.173Z - running 'install' action

v0.23.0 (failing):

13:01:58.411Z - Reconciler error: "terminal error: exceeded maximum retries"
(no subsequent reconciliation - test times out after 2 minutes)

Conclusion

The test was passing in v0.22.4 due to an unintended side effect of the standard workqueue's behavior, not because the underlying logic was correct. The priority queue in v0.23.0 exposes a pre-existing bug in the early stall detection logic after uninstall remediation.

The fix will be addressed in a separate PR.

matheuscscp · 2026-01-27T19:27:38Z

@sbueringer @alvaroaleman Since we have just discussed this, can you please confirm if the analysis above sounds correct?

alvaroaleman · 2026-01-27T19:37:09Z

Its not correct, both queues have locking that prevents hand out of an item that was already handed out and where Done() wasn't called yet, its processing in upstream and locked in the PQ.

The upstream queue handles the situation you are describing by not putting the item in the queue from which it serves Gets() and only doing that when Done() was called, the PQ handles it by skipping them in handleReady which is what gets triggered by notifyReadyItemOrWaiterAdded(), which means if a locked item was the only one available it will be handed out right after Done() was called.

matheuscscp · 2026-01-27T19:39:46Z

Thanks! Any insights about what could be different between the two queues that was causing this unintended side-effect on a terminal error?

matheuscscp · 2026-01-27T19:43:47Z

Or maybe something else entirely unrelated to the queues between v0.22.4 and v0.23 that could explain this apparent second reconciliation that was hiding this bug in helm-controller?

alvaroaleman · 2026-01-27T19:44:49Z

I don't know enough about the helm-controller to answer your question. The terminal error flow is unrelated to the workqueue, all it does is is skip the Add call on the workqueue which means the mechanism works the same regardless of what queue you use.

If you could add a debug wrapper around the queue that prints Add calls I can have another look.

matheuscscp · 2026-01-27T19:46:00Z

If you could add a debug wrapper around the queue that prints Add calls I can have another look.

Using v0.22.4, I assume?

alvaroaleman · 2026-01-27T19:49:47Z

Using v0.22.4, I assume?

EIther, the sequence of adds shouldn't change. It doesn't only need to print Add but also AddAfter and AddRatelimited which might be easier with the PQ as it all ends up in AddWithOpts there

alvaroaleman · 2026-01-27T19:52:31Z

Or maybe something else entirely unrelated to the queues between v0.22.4 and v0.23 that could explain this apparent second reconciliation that was hiding this bug in helm-controller?

You can a/b test that by using 0.23 but disabling the priority queue, its just a default.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

matheuscscp · 2026-01-27T20:14:38Z

Yeah, as the last commit shows, the behavior difference is in the PQ... I'm now debugging locally with your suggestion of wrapping the queue, will report if I find anything

matheuscscp · 2026-01-27T20:15:17Z

Debug Findings

I added a debug wrapper around the queue to log Add, AddAfter, AddRateLimited, Get, and Done calls. Here's what I found:

Timeline with Priority Queue (v0.23)

20:12:51.605Z - AddAfter scheduled for 30s (should fire at ~20:13:21.6Z)
20:12:51.605Z - Done (first reconcile completes after successful install)
20:12:57.622Z - Add called (upgrade manifest applied by user)
20:12:57.623Z - Get (upgrade reconcile starts)
20:13:01.575Z - Done (terminal error: exceeded maximum retries)
~20:13:21.6Z - AddAfter should fire but NEVER DOES!

Result

With priority queue: upgradeFailures: 1, installFailures: null
With standard workqueue: upgradeFailures: 1, installFailures: 1

Analysis

The AddAfter scheduled before the upgrade reconcile (at 20:12:51) is being cancelled when the Add is called at 20:12:57. With the standard workqueue, this AddAfter would fire after the terminal error and trigger a new reconcile that attempts (and fails) the install.

This confirms the priority queue handles pending AddAfter timers differently than the standard workqueue when an Add is called for the same item.

matheuscscp · 2026-01-27T20:16:52Z

Oops, looks like I already found something ( 🤖 😅 )

alvaroaleman · 2026-01-27T20:24:06Z

yeah, PQ de-duplicates AddAfter with Add, upstream wq does not. That this is missing in the uptream wq is a bug: kubernetes/kubernetes#110642

Looking at your timeline, clearly there was a change between the Get() and the Done() that returned a terminal error that causes the determination of returning a terminal error to not be correct anymore. That change needs to emit an event and everything will be fine.

matheuscscp · 2026-01-27T21:08:38Z

Thanks very much for all the help here!

matheuscscp and others added 2 commits January 27, 2026 18:33

Test CI with controller-runtime v0.22.4

4ab4703

This is a test commit to verify the upgrade-fail-remediate-uninstall e2e test passes with controller-runtime v0.22.4 (before the v0.23.0 upgrade). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Upgrade controller-runtime to v0.23.0

b1d200d

Testing if the upgrade-fail-remediate-uninstall e2e test fails with controller-runtime v0.23.0 due to stricter terminal error handling. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

matheuscscp changed the title ~~Test CI with controller-runtime v0.22.4~~ Test CI with different controller-runtime versions Jan 27, 2026

matheuscscp closed this Jan 27, 2026

matheuscscp deleted the test-cr-022-e2e branch January 27, 2026 19:25

Disable priority queue to test if it's the cause

c808bdb

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

matheuscscp reopened this Jan 27, 2026

matheuscscp closed this Jan 27, 2026

matheuscscp mentioned this pull request Jan 27, 2026

New default queue from controller-runtime causes uninstall upgrade remediation to leave the release uninstalled #1394

Closed

Comments

Conversation

matheuscscp commented Jan 27, 2026

Summary

Test plan

Uh oh!

matheuscscp commented Jan 27, 2026

Root Cause Analysis: upgrade-fail-remediate-uninstall e2e Test Failure

Summary

Evidence

Queue Behavior Difference

Standard Workqueue (v0.22.4)

Priority Queue (v0.23.0)

How This Affects the Test

CI Log Evidence

Conclusion

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

alvaroaleman commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

alvaroaleman commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

alvaroaleman commented Jan 27, 2026

Uh oh!

alvaroaleman commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Debug Findings

Timeline with Priority Queue (v0.23)

Result

Analysis

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

alvaroaleman commented Jan 27, 2026

Uh oh!

matheuscscp commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Root Cause Analysis: `upgrade-fail-remediate-uninstall` e2e Test Failure