Skip to content

[WIP] ROX-7352 unlock pending helm actions #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

SimonBaeumer
Copy link
Contributor

@SimonBaeumer SimonBaeumer commented Jun 16, 2021

Currently if a Helm install/upgrade/rollback action is interrupted the Helm release status is *-pending and thus does not allow Helm to perform any upgrades in the future.

When does a *-pending state occur:

  • A user executes a helm action
  • Another operator executes a helm action
    • unlikely as an operator should only have one instance with leader-election enabled
  • Operator exited unexpectedly while performing a helm action
  • User interrupted a helm action

To avoid this behaviour this PR introduces a check on *-pending Helm release state. If such state is found we assume that an action is currently in progress and set stateNeedsSkip.
The operator checks the LAST_DEPLOYED timestamp to check against a timeout of 30 seconds. As long as the LAST_DEPLOYED < 30sec ago the reconciliation is requeued after 5 seconds.
After the timeout is reached a statedNeedsRollback is set and a rollback is performed to get rid of the *-pending state.

Testing:

  • Test installation is interrupted
  • Test upgrade is interrupted
  • Test rollback is interrupted
  • Automated tests

@SimonBaeumer SimonBaeumer changed the title ROX-7352 unlock pending helm actions [WIP] ROX-7352 unlock pending helm actions Jun 16, 2021
@joelanford
Copy link
Contributor

Are there any cases where a release can be pending over multiple reconciliations for a legitimate reason?

Hook execution comes to mind, but it seems like the Helm client library waits and blocks on hook execution, so theoretically, a release left in pending means that the wait was interrupted. Otherwise the helm client library would have left the release in a terminal success or fail state.

I'm generally hesitant to implement this sort of logic without involving the upstream helm community, because I've never seen a documented compatibility guarantee for state transition semantics. If there isn't a guarantee that the semantics won't change, its possible assumptions we make could break in the future.

Related upstream issues:

Copy link
Collaborator

@porridge porridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to find any helm documentation on what purpose pending actually serves.
Therefore I personally would not add an auto-healing feature for problem I do not fully understand.

@@ -28,7 +28,9 @@ const (
TypeInitialized = "Initialized"
TypeDeployed = "Deployed"
TypeReleaseFailed = "ReleaseFailed"
TypeRollbackFailed = "TypeRollbackFailed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TypeRollbackFailed = "TypeRollbackFailed"
TypeRollbackFailed = "RollbackFailed"

None of the other types starts with Type...

const uninstallFinalizer = "uninstall-helm-release"
const (
uninstallFinalizer = "uninstall-helm-release"
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a pending release

uninstallFinalizer = "uninstall-helm-release"
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release
pendingReleaseTimeout = time.Second * 30
// pendingRetryInterval defines the time waiting until the current reconciliation finished to avoid spamming helm actions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by "spamming helm actions"?

Comment on lines +571 to +576
case statedNeedsRollback:
{
if err := r.doRollback(actionClient, &u, obj); err != nil {
return ctrl.Result{}, err
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the block?

Suggested change
case statedNeedsRollback:
{
if err := r.doRollback(actionClient, &u, obj); err != nil {
return ctrl.Result{}, err
}
}
case statedNeedsRollback:
if err := r.doRollback(actionClient, &u, obj); err != nil {
return ctrl.Result{}, err
}

//TODO(do-not-merge): as pending releases are now rolled back it is necessary to ensure that an in-progress helm release is not cancelled
// and this code is only executed to ensure aborted releases are rolled back.
if currentRelease.Info.Status.IsPending() {
t := time.Now().Sub(currentRelease.Info.LastDeployed.Time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be prepared to the case when LastDeployed is not set..

@@ -679,7 +713,7 @@ func (r *Reconciler) getReleaseState(client helmclient.ActionInterface, obj meta
if err != nil {
return currentRelease, stateError, err
}
if specRelease.Manifest != currentRelease.Manifest ||
if specRelease.Manifest != fmt.Sprintf("%d%s", rand.Int31(), currentRelease.Manifest) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary hack, I assume?

@misberner
Copy link
Contributor

I was not able to find any helm documentation on what purpose pending actually serves.
Therefore I personally would not add an auto-healing feature for problem I do not fully understand.

@SimonBaeumer so the symptom (w/o this PR) is that whenever the operator crashes while it executes a Helm action, the operand will get stuck in a bad state and never heal itself? This, honestly, seems pretty bad - the operator here introduces a problem of the type that it promises to solve. A better understanding what the mechanics of this Helm feature are might indeed be required, but if my description of the symptom is correct, it's a no-brainer for me that we need to do something about it.

@misberner
Copy link
Contributor

Are there any cases where a release can be pending over multiple reconciliations for a legitimate reason?

Not sure I understand. Reconciliations happen through upgrades, upgrades will fail if it is pending. It cannot remain in a pending state past a successful reconciliation, so not sure what "pending over multiple reconciliations" would mean.

Hook execution comes to mind, but it seems like the Helm client library waits and blocks on hook execution, so theoretically, a release left in pending means that the wait was interrupted. Otherwise the helm client library would have left the release in a terminal success or fail state.

Yes, as the PR description says, this addresses the case when an upgrade attempt got interrupted (e.g., operator crashed).

I'm generally hesitant to implement this sort of logic without involving the upstream helm community, because I've never seen a documented compatibility guarantee for state transition semantics. If there isn't a guarantee that the semantics won't change, its possible assumptions we make could break in the future.

I don't think I agree here. If I run helm upgrade, and my then machine crashes, I need to do a helm rollback. I think this is pretty officially what to do in such cases (obviously making sure that the effects of any pending hooks are reverted/neutralized as well). And I think an operator should mimic what a skilled human user would do for operating the operand application.

Since operators have leader election, an operator can be certain that if it encounters a Helm release that it manages in a pending state, a rollback is necessary. It is certainly debatable whether it is a good idea to always automatically attempt a rollback (as said above, a skilled human user might first weigh additional considerations, mostly around hook executions and the possibility of someone else legitimately having triggered an upgrade that they didn't know about), but it's imho pretty clear that there are cases when it can be easily determined that a rollback is the right course of action, and it should be possible through options to instruct the operator to take that path.

@joelanford
Copy link
Contributor

Not sure I understand. Reconciliations happen through upgrades, upgrades will fail if it is pending. It cannot remain in a pending state past a successful reconciliation, so not sure what "pending over multiple reconciliations" would mean.

I worded this poorly. What I mean is, if one uses just the Helm CLI, are there any scenarios that the Helm CLI will exit successfully and the release is left in a pending state, such that it would be expected to run the upgrade command again? I'm fairly sure the answer is no, but not positive.

Yes, as the PR description says, this addresses the case when an upgrade attempt got interrupted (e.g., operator crashed).

I'm not disputing that. I'm just wondering if this PR is making an invalid assumption (i.e. that anytime a release is in pending, it must have been because of an interruption). Based on the upstream issues I linked, that seems to be the case right now. But will that always be the case (in Helm v3, at least)?

If I run helm upgrade, and my then machine crashes, I need to do a helm rollback.

With the knowledge that you had a failed upgrade, I agree that a rollback is the correct thing to do (and that's what Helm operator does today).

But from the perspective of the Helm operator finding a release in the pending state, it doesn't know anything about what led to that or whether a rollback is appropriate. The operator may not even know if it is the process that made the original (failed) upgrade attempt. For example, maybe an existing CLI-created pending release was inherited by the operator when someone created a CR for it.

All I'm saying is that we should go confirm our assumptions (and the stability of our assumptions relative to Helm's major releases) before we automatically rollback every release we find that's stuck in the pending state.

@misberner
Copy link
Contributor

I worded this poorly. What I mean is, if one uses just the Helm CLI, are there any scenarios that the Helm CLI will exit successfully and the release is left in a pending state, such that it would be expected to run the upgrade command again? I'm fairly sure the answer is no, but not positive.

Ah, gotcha. I think this answers it pretty definitely: helm/helm#7322 the pending state was introduced as a "poor person's lock" to prevent the concurrent execution of multiple Helm commands.

This is no guarantee that nobody will ever add manual transaction commands (helm start-upgrade ; ... helm commit-upgrade), but I am not sure if getting such a guarantee is feasible. However, given that helm-operator chooses its helm version, I do not see such future changes as a major concern.

I'm not disputing that. I'm just wondering if this PR is making an invalid assumption (i.e. that anytime a release is in pending, it must have been because of an interruption). Based on the upstream issues I linked, that seems to be the case right now. But will that always be the case (in Helm v3, at least)?

See above. If I look at the attempts to handle SIGTERM properly, there certainly isn't a super straightforward way to clean this up. However, again, any such change would IMHO break the human user's expectation as well so I am not sure it is something it can be meaningfully be guarded against. The sort of clear specification of the mechanics does not seem to exist for Helm; but given that the operator always needs to fix a specific Helm client version, I am not sure if that's too relevant.

With the knowledge that you had a failed upgrade, I agree that a rollback is the correct thing to do (and that's what Helm operator does today).

Thanks for the reference. However, this only works if the controller observed the upgrade failure immediately in the same process attempting the rollback. Isn't this already violating the mandate that a controller should be stateless? A controller that crashes after receiving the upgrade error will lose state that allows a non-crashing controller to proceed.

But from the perspective of the Helm operator finding a release in the pending state, it doesn't know anything about what led to that or whether a rollback is appropriate.

I think it is a reasonable assumption that a controller should assume exclusive ownership of all Helm releases it manages. The secret storing Helm release info is operand state just as all the other Kubernetes manifests, where any user-caused changes are simply wiped clean on reconciliation.

The operator may not even know if it is the process that made the original (failed) upgrade attempt.

In case it crashes, it isn't the same process - and exactly that's the issue. The owner reference (and yes, I think it's important to check for that, not disagreeing here) however should guarantee that it should be the only one making changes.

For example, maybe an existing CLI-created pending release was inherited by the operator when someone created a CR for it.

Right. Regardless of pending states, the helm-operator should not attempt an upgrade on a pre-existing Helm release that does not indicate operator ownership - i.e., an upgrade may only performed on a release that carries the reference, and a reference may never be added after the fact, but only in the process of an initial install (which must fail if the release secret is pre-existing). If this is already implemented this way, then your failure scenario cannot occur (assuming we have the owner reference check). If it is not, this seems to be an orthogonal issue.

@misberner
Copy link
Contributor

@joelanford to be clear, I am perfectly fine to file an issue with Helm to seek clarification. But I would first like us to agree on whether the operator may assume that it has exclusive ownership of the Helm release secret (not physically enforced, but in the sense that it may disregard modifications done by others) - because that is something that the Helm people cannot answer for us anyway.

@joelanford
Copy link
Contributor

an upgrade may only performed on a release that carries the reference, and a reference may never be added after the fact, but only in the process of an initial install (which must fail if the release secret is pre-existing). If this is already implemented this way, then your failure scenario cannot occur (assuming we have the owner reference check). If it is not, this seems to be an orthogonal issue.

👍

I don't remember whether it works this way right now. If not, I agree that it should.

But I would first like us to agree on whether the operator may assume that it has exclusive ownership of the Helm release secret.

Yep, agreed.

@misberner
Copy link
Contributor

@joelanford filed an issue with Helm. Please chime in if you see the need to add something.

@misberner
Copy link
Contributor

@joelanford re the other issue, I looked at the code and it looks like there is no straightforward way to enforce the ownership constraint on the release secret. The Secrets storage driver in Helm is not implemented in a way that would allow the proper conditional modifications or deletions that would allow providing strong guarantees on the object state being manipulated. The only possibility I see would be to create an implementation of corev1.SecretInterface that realizes, e.g., a deletion by doing

  • get the resource with the name
  • verify that the controller ref exists (error out if it doesn't)
  • perform a delete with the UID and ResourceVersion fields in opts.precondition set accordingly (and retry on conflicts)

But this seems fairly brittle, tbh.

@misberner misberner closed this Aug 6, 2021
@porridge
Copy link
Collaborator

porridge commented Nov 9, 2021

@misberner this was superseded by #16, correct?

@misberner
Copy link
Contributor

@porridge yes, correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants