[WIP] ROX-7352 unlock pending helm actions #13

SimonBaeumer · 2021-06-16T09:43:26Z

Currently if a Helm install/upgrade/rollback action is interrupted the Helm release status is *-pending and thus does not allow Helm to perform any upgrades in the future.

When does a *-pending state occur:

A user executes a helm action
Another operator executes a helm action
- unlikely as an operator should only have one instance with leader-election enabled
Operator exited unexpectedly while performing a helm action
User interrupted a helm action

To avoid this behaviour this PR introduces a check on *-pending Helm release state. If such state is found we assume that an action is currently in progress and set stateNeedsSkip.
The operator checks the LAST_DEPLOYED timestamp to check against a timeout of 30 seconds. As long as the LAST_DEPLOYED < 30sec ago the reconciliation is requeued after 5 seconds.
After the timeout is reached a statedNeedsRollback is set and a rollback is performed to get rid of the *-pending state.

Testing:

Test installation is interrupted
Test upgrade is interrupted
Test rollback is interrupted
Automated tests

joelanford · 2021-06-16T16:21:11Z

Are there any cases where a release can be pending over multiple reconciliations for a legitimate reason?

Hook execution comes to mind, but it seems like the Helm client library waits and blocks on hook execution, so theoretically, a release left in pending means that the wait was interrupted. Otherwise the helm client library would have left the release in a terminal success or fail state.

I'm generally hesitant to implement this sort of logic without involving the upstream helm community, because I've never seen a documented compatibility guarantee for state transition semantics. If there isn't a guarantee that the semantics won't change, its possible assumptions we make could break in the future.

Related upstream issues:

porridge

I was not able to find any helm documentation on what purpose pending actually serves.
Therefore I personally would not add an auto-healing feature for problem I do not fully understand.

porridge · 2021-06-17T07:47:50Z

pkg/reconciler/internal/conditions/conditions.go

@@ -28,7 +28,9 @@ const (
 	TypeInitialized    = "Initialized"
 	TypeDeployed       = "Deployed"
 	TypeReleaseFailed  = "ReleaseFailed"
+	TypeRollbackFailed = "TypeRollbackFailed"


Suggested change

TypeRollbackFailed = "TypeRollbackFailed"

TypeRollbackFailed = "RollbackFailed"

None of the other types starts with Type...

porridge · 2021-06-17T07:49:14Z

pkg/reconciler/reconciler.go

-const uninstallFinalizer = "uninstall-helm-release"
+const (
+	uninstallFinalizer = "uninstall-helm-release"
+	// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release


Suggested change

// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release

// pendingReleaseTimeout defines the time frame after which a rollback is performed on a pending release

porridge · 2021-06-17T07:49:48Z

pkg/reconciler/reconciler.go

+	uninstallFinalizer = "uninstall-helm-release"
+	// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release
+	pendingReleaseTimeout = time.Second * 30
+	// pendingRetryInterval defines the time waiting until the current reconciliation finished to avoid spamming helm actions


Can you explain what you mean by "spamming helm actions"?

porridge · 2021-06-17T07:51:04Z

pkg/reconciler/reconciler.go

+	case statedNeedsRollback:
+		{
+			if err := r.doRollback(actionClient, &u, obj); err != nil {
+				return ctrl.Result{}, err
+			}
+		}


Do we need the block?

Suggested change

case statedNeedsRollback:

{

if err := r.doRollback(actionClient, &u, obj); err != nil {

return ctrl.Result{}, err

}

}

case statedNeedsRollback:

if err := r.doRollback(actionClient, &u, obj); err != nil {

return ctrl.Result{}, err

}

porridge · 2021-06-17T07:54:17Z

pkg/reconciler/reconciler.go

+	//TODO(do-not-merge): as pending releases are now rolled back it is necessary to ensure that an in-progress helm release is not cancelled
+	// and this code is only executed to ensure aborted releases are rolled back.
+	if currentRelease.Info.Status.IsPending() {
+		t := time.Now().Sub(currentRelease.Info.LastDeployed.Time)


We should probably be prepared to the case when LastDeployed is not set..

porridge · 2021-06-17T07:54:53Z

pkg/reconciler/reconciler.go

@@ -679,7 +713,7 @@ func (r *Reconciler) getReleaseState(client helmclient.ActionInterface, obj meta
 	if err != nil {
 		return currentRelease, stateError, err
 	}
-	if specRelease.Manifest != currentRelease.Manifest ||
+	if specRelease.Manifest != fmt.Sprintf("%d%s", rand.Int31(), currentRelease.Manifest) ||


Temporary hack, I assume?

misberner · 2021-06-17T08:12:33Z

I was not able to find any helm documentation on what purpose pending actually serves.
Therefore I personally would not add an auto-healing feature for problem I do not fully understand.

@SimonBaeumer so the symptom (w/o this PR) is that whenever the operator crashes while it executes a Helm action, the operand will get stuck in a bad state and never heal itself? This, honestly, seems pretty bad - the operator here introduces a problem of the type that it promises to solve. A better understanding what the mechanics of this Helm feature are might indeed be required, but if my description of the symptom is correct, it's a no-brainer for me that we need to do something about it.

misberner · 2021-06-21T11:57:48Z

Are there any cases where a release can be pending over multiple reconciliations for a legitimate reason?

Not sure I understand. Reconciliations happen through upgrades, upgrades will fail if it is pending. It cannot remain in a pending state past a successful reconciliation, so not sure what "pending over multiple reconciliations" would mean.

Hook execution comes to mind, but it seems like the Helm client library waits and blocks on hook execution, so theoretically, a release left in pending means that the wait was interrupted. Otherwise the helm client library would have left the release in a terminal success or fail state.

Yes, as the PR description says, this addresses the case when an upgrade attempt got interrupted (e.g., operator crashed).

I'm generally hesitant to implement this sort of logic without involving the upstream helm community, because I've never seen a documented compatibility guarantee for state transition semantics. If there isn't a guarantee that the semantics won't change, its possible assumptions we make could break in the future.

I don't think I agree here. If I run helm upgrade, and my then machine crashes, I need to do a helm rollback. I think this is pretty officially what to do in such cases (obviously making sure that the effects of any pending hooks are reverted/neutralized as well). And I think an operator should mimic what a skilled human user would do for operating the operand application.

Since operators have leader election, an operator can be certain that if it encounters a Helm release that it manages in a pending state, a rollback is necessary. It is certainly debatable whether it is a good idea to always automatically attempt a rollback (as said above, a skilled human user might first weigh additional considerations, mostly around hook executions and the possibility of someone else legitimately having triggered an upgrade that they didn't know about), but it's imho pretty clear that there are cases when it can be easily determined that a rollback is the right course of action, and it should be possible through options to instruct the operator to take that path.

joelanford · 2021-06-21T15:09:19Z

Not sure I understand. Reconciliations happen through upgrades, upgrades will fail if it is pending. It cannot remain in a pending state past a successful reconciliation, so not sure what "pending over multiple reconciliations" would mean.

I worded this poorly. What I mean is, if one uses just the Helm CLI, are there any scenarios that the Helm CLI will exit successfully and the release is left in a pending state, such that it would be expected to run the upgrade command again? I'm fairly sure the answer is no, but not positive.

Yes, as the PR description says, this addresses the case when an upgrade attempt got interrupted (e.g., operator crashed).

I'm not disputing that. I'm just wondering if this PR is making an invalid assumption (i.e. that anytime a release is in pending, it must have been because of an interruption). Based on the upstream issues I linked, that seems to be the case right now. But will that always be the case (in Helm v3, at least)?

If I run helm upgrade, and my then machine crashes, I need to do a helm rollback.

With the knowledge that you had a failed upgrade, I agree that a rollback is the correct thing to do (and that's what Helm operator does today).

But from the perspective of the Helm operator finding a release in the pending state, it doesn't know anything about what led to that or whether a rollback is appropriate. The operator may not even know if it is the process that made the original (failed) upgrade attempt. For example, maybe an existing CLI-created pending release was inherited by the operator when someone created a CR for it.

All I'm saying is that we should go confirm our assumptions (and the stability of our assumptions relative to Helm's major releases) before we automatically rollback every release we find that's stuck in the pending state.

misberner · 2021-06-21T15:50:07Z

I worded this poorly. What I mean is, if one uses just the Helm CLI, are there any scenarios that the Helm CLI will exit successfully and the release is left in a pending state, such that it would be expected to run the upgrade command again? I'm fairly sure the answer is no, but not positive.

Ah, gotcha. I think this answers it pretty definitely: helm/helm#7322 the pending state was introduced as a "poor person's lock" to prevent the concurrent execution of multiple Helm commands.

This is no guarantee that nobody will ever add manual transaction commands (helm start-upgrade ; ... helm commit-upgrade), but I am not sure if getting such a guarantee is feasible. However, given that helm-operator chooses its helm version, I do not see such future changes as a major concern.

I'm not disputing that. I'm just wondering if this PR is making an invalid assumption (i.e. that anytime a release is in pending, it must have been because of an interruption). Based on the upstream issues I linked, that seems to be the case right now. But will that always be the case (in Helm v3, at least)?

See above. If I look at the attempts to handle SIGTERM properly, there certainly isn't a super straightforward way to clean this up. However, again, any such change would IMHO break the human user's expectation as well so I am not sure it is something it can be meaningfully be guarded against. The sort of clear specification of the mechanics does not seem to exist for Helm; but given that the operator always needs to fix a specific Helm client version, I am not sure if that's too relevant.

With the knowledge that you had a failed upgrade, I agree that a rollback is the correct thing to do (and that's what Helm operator does today).

Thanks for the reference. However, this only works if the controller observed the upgrade failure immediately in the same process attempting the rollback. Isn't this already violating the mandate that a controller should be stateless? A controller that crashes after receiving the upgrade error will lose state that allows a non-crashing controller to proceed.

But from the perspective of the Helm operator finding a release in the pending state, it doesn't know anything about what led to that or whether a rollback is appropriate.

I think it is a reasonable assumption that a controller should assume exclusive ownership of all Helm releases it manages. The secret storing Helm release info is operand state just as all the other Kubernetes manifests, where any user-caused changes are simply wiped clean on reconciliation.

The operator may not even know if it is the process that made the original (failed) upgrade attempt.

In case it crashes, it isn't the same process - and exactly that's the issue. The owner reference (and yes, I think it's important to check for that, not disagreeing here) however should guarantee that it should be the only one making changes.

For example, maybe an existing CLI-created pending release was inherited by the operator when someone created a CR for it.

Right. Regardless of pending states, the helm-operator should not attempt an upgrade on a pre-existing Helm release that does not indicate operator ownership - i.e., an upgrade may only performed on a release that carries the reference, and a reference may never be added after the fact, but only in the process of an initial install (which must fail if the release secret is pre-existing). If this is already implemented this way, then your failure scenario cannot occur (assuming we have the owner reference check). If it is not, this seems to be an orthogonal issue.

misberner · 2021-06-21T15:56:59Z

@joelanford to be clear, I am perfectly fine to file an issue with Helm to seek clarification. But I would first like us to agree on whether the operator may assume that it has exclusive ownership of the Helm release secret (not physically enforced, but in the sense that it may disregard modifications done by others) - because that is something that the Helm people cannot answer for us anyway.

joelanford · 2021-06-21T16:40:17Z

an upgrade may only performed on a release that carries the reference, and a reference may never be added after the fact, but only in the process of an initial install (which must fail if the release secret is pre-existing). If this is already implemented this way, then your failure scenario cannot occur (assuming we have the owner reference check). If it is not, this seems to be an orthogonal issue.

👍

I don't remember whether it works this way right now. If not, I agree that it should.

But I would first like us to agree on whether the operator may assume that it has exclusive ownership of the Helm release secret.

Yep, agreed.

misberner · 2021-06-21T19:30:39Z

@joelanford filed an issue with Helm. Please chime in if you see the need to add something.

misberner · 2021-06-21T21:36:43Z

@joelanford re the other issue, I looked at the code and it looks like there is no straightforward way to enforce the ownership constraint on the release secret. The Secrets storage driver in Helm is not implemented in a way that would allow the proper conditional modifications or deletions that would allow providing strong guarantees on the object state being manipulated. The only possibility I see would be to create an implementation of corev1.SecretInterface that realizes, e.g., a deletion by doing

get the resource with the name
verify that the controller ref exists (error out if it doesn't)
perform a delete with the UID and ResourceVersion fields in opts.precondition set accordingly (and retry on conflicts)

But this seems fairly brittle, tbh.

porridge · 2021-11-09T09:36:40Z

@misberner this was superseded by #16, correct?

misberner · 2021-11-09T09:40:42Z

@porridge yes, correct

SimonBaeumer added 2 commits June 16, 2021 10:49

X-Smart-Branch-Parent: main

f436c1e

Unlock pending releases

aece28d

github-actions bot added the area/sdk label Jun 16, 2021

SimonBaeumer added 2 commits June 16, 2021 14:10

Add reconcile requeue on pending state

b8a7640

style

e3d12db

SimonBaeumer requested review from porridge and misberner June 16, 2021 12:22

SimonBaeumer changed the title ~~ROX-7352 unlock pending helm actions~~ [WIP] ROX-7352 unlock pending helm actions Jun 16, 2021

porridge reviewed Jun 17, 2021

View reviewed changes

misberner mentioned this pull request Jun 21, 2021

Seeking clarification on state transitions helm/helm#9843

Closed

misberner closed this Aug 6, 2021

	TypeRollbackFailed = "TypeRollbackFailed"
	TypeRollbackFailed = "RollbackFailed"

	// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release
	// pendingReleaseTimeout defines the time frame after which a rollback is performed on a pending release

[WIP] ROX-7352 unlock pending helm actions #13

[WIP] ROX-7352 unlock pending helm actions #13

Uh oh!

Conversation

SimonBaeumer commented Jun 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joelanford commented Jun 16, 2021

Uh oh!

porridge left a comment

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

porridge Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

misberner commented Jun 17, 2021

Uh oh!

misberner commented Jun 21, 2021

Uh oh!

joelanford commented Jun 21, 2021

Uh oh!

misberner commented Jun 21, 2021

Uh oh!

misberner commented Jun 21, 2021

Uh oh!

joelanford commented Jun 21, 2021

Uh oh!

misberner commented Jun 21, 2021

Uh oh!

misberner commented Jun 21, 2021

Uh oh!

porridge commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

misberner commented Nov 9, 2021

Uh oh!

Uh oh!

SimonBaeumer commented Jun 16, 2021 •

edited

Loading

porridge commented Nov 9, 2021 •

edited

Loading