Skip to content

New default queue from controller-runtime causes uninstall upgrade remediation to leave the release uninstalled #1394

@matheuscscp

Description

@matheuscscp

After my investigation with help from Alvaro (controller-runtime maintainer) in #1393, we found out that an e2e test was passing due to a bug in the upstream workqueue. This was exposed by switching to the controller-runtime priorityqueue (the new default queue since c-r v0.23).

This test, specifically:

      - name: Run upgrade fail with uninstall remediation strategy test
        run: |
          test_name=upgrade-fail-remediate-uninstall

Which uses the following HelmRelease after a successful install:

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: upgrade-fail-remediate-uninstall
spec:
  interval: 30s
  chart:
    spec:
      chart: podinfo
      version: '>=6.0.0 <7.0.0'
      sourceRef:
        kind: HelmRepository
        name: podinfo
      interval: 10m
  upgrade:
    remediation:
      remediateLastFailure: true
      strategy: uninstall
  values:
    resources:
      requests:
        cpu: 100m
        memory: 64Mi
    # Make wait fail. With single replica helm doesn't actually wait, see:
    # https://github.com/helm/helm/issues/5814#issuecomment-567130226
    replicaCount: 2
    faults:
      unready: true
  timeout: 3s

This test expects that the release will be installed after the uninstall remediation (this install should also fail, like the upgrade that caused the uninstall remediation).

When reading the docs for upgrade remediation, my understanding is that the configuration above should uninstall the release and leave it like that, as a terminal failure. There's nothing in the docs saying that a reconciliation should be enqueued (e.g. with Requeue: true or RequeueAfter: some-time) in case an uninstall upgrade remediation takes places, so an install can be attempted later. The current state of the code agrees with this interpretation: a terminal error is returned to controller-runtime, but the workqueue bug causes a reconciliation to be triggered, which causes an install action to take place. However, helm-controller is GA, and the current behavior, even if due to this hidden bug, is what is currently considered GA.

We have two alternatives:

  1. Intentionally cause the requeue in the code to maintain this behavior and document it
  2. Fix the e2e test and remove the expectation of an install after the uninstall remediation

Since helm-controller is GA, I suppose we should go with 1?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions