-
Notifications
You must be signed in to change notification settings - Fork 200
Description
After my investigation with help from Alvaro (controller-runtime maintainer) in #1393, we found out that an e2e test was passing due to a bug in the upstream workqueue. This was exposed by switching to the controller-runtime priorityqueue (the new default queue since c-r v0.23).
This test, specifically:
- name: Run upgrade fail with uninstall remediation strategy test
run: |
test_name=upgrade-fail-remediate-uninstallWhich uses the following HelmRelease after a successful install:
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: upgrade-fail-remediate-uninstall
spec:
interval: 30s
chart:
spec:
chart: podinfo
version: '>=6.0.0 <7.0.0'
sourceRef:
kind: HelmRepository
name: podinfo
interval: 10m
upgrade:
remediation:
remediateLastFailure: true
strategy: uninstall
values:
resources:
requests:
cpu: 100m
memory: 64Mi
# Make wait fail. With single replica helm doesn't actually wait, see:
# https://github.com/helm/helm/issues/5814#issuecomment-567130226
replicaCount: 2
faults:
unready: true
timeout: 3sThis test expects that the release will be installed after the uninstall remediation (this install should also fail, like the upgrade that caused the uninstall remediation).
When reading the docs for upgrade remediation, my understanding is that the configuration above should uninstall the release and leave it like that, as a terminal failure. There's nothing in the docs saying that a reconciliation should be enqueued (e.g. with Requeue: true or RequeueAfter: some-time) in case an uninstall upgrade remediation takes places, so an install can be attempted later. The current state of the code agrees with this interpretation: a terminal error is returned to controller-runtime, but the workqueue bug causes a reconciliation to be triggered, which causes an install action to take place. However, helm-controller is GA, and the current behavior, even if due to this hidden bug, is what is currently considered GA.
We have two alternatives:
- Intentionally cause the requeue in the code to maintain this behavior and document it
- Fix the e2e test and remove the expectation of an install after the uninstall remediation
Since helm-controller is GA, I suppose we should go with 1?