New default queue from `controller-runtime` causes `uninstall` upgrade remediation to leave the release uninstalled

After my investigation with help from Alvaro (`controller-runtime` maintainer) in #1393, we found out that an e2e test was passing due to a bug in the upstream workqueue. This was exposed by switching to the `controller-runtime` priorityqueue (the new default queue since c-r v0.23).

This test, specifically:

```yaml
      - name: Run upgrade fail with uninstall remediation strategy test
        run: |
          test_name=upgrade-fail-remediate-uninstall
```

Which uses the following HelmRelease after a successful install:

```yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: upgrade-fail-remediate-uninstall
spec:
  interval: 30s
  chart:
    spec:
      chart: podinfo
      version: '>=6.0.0 <7.0.0'
      sourceRef:
        kind: HelmRepository
        name: podinfo
      interval: 10m
  upgrade:
    remediation:
      remediateLastFailure: true
      strategy: uninstall
  values:
    resources:
      requests:
        cpu: 100m
        memory: 64Mi
    # Make wait fail. With single replica helm doesn't actually wait, see:
    # https://github.com/helm/helm/issues/5814#issuecomment-567130226
    replicaCount: 2
    faults:
      unready: true
  timeout: 3s
```

This test expects that the release will be installed after the uninstall remediation (this install should also fail, like the upgrade that caused the uninstall remediation).

When reading the [docs](https://fluxcd.io/flux/components/helm/helmreleases/#upgrade-remediation) for upgrade remediation, my understanding is that the configuration above should uninstall the release and leave it like that, as a terminal failure. There's nothing in the docs saying that a reconciliation should be enqueued (e.g. with `Requeue: true` or `RequeueAfter: some-time`) in case an uninstall upgrade remediation takes places, so an install can be attempted later. The current state of the code agrees with this interpretation: a terminal error is returned to controller-runtime, but the workqueue bug causes a reconciliation to be triggered, which causes an install action to take place. However, helm-controller is GA, and the current behavior, even if due to this hidden bug, is what is currently considered GA.

We have two alternatives:

1. Intentionally cause the requeue in the code to maintain this behavior **and document it**
2. Fix the e2e test and remove the expectation of an install after the uninstall remediation

Since helm-controller is GA, I suppose we should go with 1?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New default queue from `controller-runtime` causes `uninstall` upgrade remediation to leave the release uninstalled #1394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New default queue from controller-runtime causes uninstall upgrade remediation to leave the release uninstalled #1394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New default queue from `controller-runtime` causes `uninstall` upgrade remediation to leave the release uninstalled #1394