-
Notifications
You must be signed in to change notification settings - Fork 2
[WIP] ROX-7352 unlock pending helm actions #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Are there any cases where a release can be pending over multiple reconciliations for a legitimate reason? Hook execution comes to mind, but it seems like the Helm client library waits and blocks on hook execution, so theoretically, a release left in pending means that the wait was interrupted. Otherwise the helm client library would have left the release in a terminal success or fail state. I'm generally hesitant to implement this sort of logic without involving the upstream helm community, because I've never seen a documented compatibility guarantee for state transition semantics. If there isn't a guarantee that the semantics won't change, its possible assumptions we make could break in the future. Related upstream issues: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to find any helm documentation on what purpose pending
actually serves.
Therefore I personally would not add an auto-healing feature for problem I do not fully understand.
@@ -28,7 +28,9 @@ const ( | |||
TypeInitialized = "Initialized" | |||
TypeDeployed = "Deployed" | |||
TypeReleaseFailed = "ReleaseFailed" | |||
TypeRollbackFailed = "TypeRollbackFailed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeRollbackFailed = "TypeRollbackFailed" | |
TypeRollbackFailed = "RollbackFailed" |
None of the other types starts with Type
...
const uninstallFinalizer = "uninstall-helm-release" | ||
const ( | ||
uninstallFinalizer = "uninstall-helm-release" | ||
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release | |
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a pending release |
uninstallFinalizer = "uninstall-helm-release" | ||
// pendingReleaseTimeout defines the time frame after which a rollback is performed on a given release | ||
pendingReleaseTimeout = time.Second * 30 | ||
// pendingRetryInterval defines the time waiting until the current reconciliation finished to avoid spamming helm actions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what you mean by "spamming helm actions"?
case statedNeedsRollback: | ||
{ | ||
if err := r.doRollback(actionClient, &u, obj); err != nil { | ||
return ctrl.Result{}, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the block?
case statedNeedsRollback: | |
{ | |
if err := r.doRollback(actionClient, &u, obj); err != nil { | |
return ctrl.Result{}, err | |
} | |
} | |
case statedNeedsRollback: | |
if err := r.doRollback(actionClient, &u, obj); err != nil { | |
return ctrl.Result{}, err | |
} |
//TODO(do-not-merge): as pending releases are now rolled back it is necessary to ensure that an in-progress helm release is not cancelled | ||
// and this code is only executed to ensure aborted releases are rolled back. | ||
if currentRelease.Info.Status.IsPending() { | ||
t := time.Now().Sub(currentRelease.Info.LastDeployed.Time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be prepared to the case when LastDeployed
is not set..
@@ -679,7 +713,7 @@ func (r *Reconciler) getReleaseState(client helmclient.ActionInterface, obj meta | |||
if err != nil { | |||
return currentRelease, stateError, err | |||
} | |||
if specRelease.Manifest != currentRelease.Manifest || | |||
if specRelease.Manifest != fmt.Sprintf("%d%s", rand.Int31(), currentRelease.Manifest) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary hack, I assume?
@SimonBaeumer so the symptom (w/o this PR) is that whenever the operator crashes while it executes a Helm action, the operand will get stuck in a bad state and never heal itself? This, honestly, seems pretty bad - the operator here introduces a problem of the type that it promises to solve. A better understanding what the mechanics of this Helm feature are might indeed be required, but if my description of the symptom is correct, it's a no-brainer for me that we need to do something about it. |
Not sure I understand. Reconciliations happen through upgrades, upgrades will fail if it is pending. It cannot remain in a pending state past a successful reconciliation, so not sure what "pending over multiple reconciliations" would mean.
Yes, as the PR description says, this addresses the case when an upgrade attempt got interrupted (e.g., operator crashed).
I don't think I agree here. If I run Since operators have leader election, an operator can be certain that if it encounters a Helm release that it manages in a pending state, a rollback is necessary. It is certainly debatable whether it is a good idea to always automatically attempt a rollback (as said above, a skilled human user might first weigh additional considerations, mostly around hook executions and the possibility of someone else legitimately having triggered an upgrade that they didn't know about), but it's imho pretty clear that there are cases when it can be easily determined that a rollback is the right course of action, and it should be possible through options to instruct the operator to take that path. |
I worded this poorly. What I mean is, if one uses just the Helm CLI, are there any scenarios that the Helm CLI will exit successfully and the release is left in a pending state, such that it would be expected to run the upgrade command again? I'm fairly sure the answer is no, but not positive.
I'm not disputing that. I'm just wondering if this PR is making an invalid assumption (i.e. that anytime a release is in pending, it must have been because of an interruption). Based on the upstream issues I linked, that seems to be the case right now. But will that always be the case (in Helm v3, at least)?
With the knowledge that you had a failed upgrade, I agree that a rollback is the correct thing to do (and that's what Helm operator does today). But from the perspective of the Helm operator finding a release in the pending state, it doesn't know anything about what led to that or whether a rollback is appropriate. The operator may not even know if it is the process that made the original (failed) upgrade attempt. For example, maybe an existing CLI-created pending release was inherited by the operator when someone created a CR for it. All I'm saying is that we should go confirm our assumptions (and the stability of our assumptions relative to Helm's major releases) before we automatically rollback every release we find that's stuck in the pending state. |
Ah, gotcha. I think this answers it pretty definitely: helm/helm#7322 the pending state was introduced as a "poor person's lock" to prevent the concurrent execution of multiple Helm commands. This is no guarantee that nobody will ever add manual transaction commands (
See above. If I look at the attempts to handle
Thanks for the reference. However, this only works if the controller observed the upgrade failure immediately in the same process attempting the rollback. Isn't this already violating the mandate that a controller should be stateless? A controller that crashes after receiving the upgrade error will lose state that allows a non-crashing controller to proceed.
I think it is a reasonable assumption that a controller should assume exclusive ownership of all Helm releases it manages. The secret storing Helm release info is operand state just as all the other Kubernetes manifests, where any user-caused changes are simply wiped clean on reconciliation.
In case it crashes, it isn't the same process - and exactly that's the issue. The owner reference (and yes, I think it's important to check for that, not disagreeing here) however should guarantee that it should be the only one making changes.
Right. Regardless of pending states, the |
@joelanford to be clear, I am perfectly fine to file an issue with Helm to seek clarification. But I would first like us to agree on whether the operator may assume that it has exclusive ownership of the Helm release secret (not physically enforced, but in the sense that it may disregard modifications done by others) - because that is something that the Helm people cannot answer for us anyway. |
👍 I don't remember whether it works this way right now. If not, I agree that it should.
Yep, agreed. |
@joelanford filed an issue with Helm. Please chime in if you see the need to add something. |
@joelanford re the other issue, I looked at the code and it looks like there is no straightforward way to enforce the ownership constraint on the release secret. The
But this seems fairly brittle, tbh. |
@misberner this was superseded by #16, correct? |
@porridge yes, correct |
Currently if a Helm install/upgrade/rollback action is interrupted the Helm release status is
*-pending
and thus does not allow Helm to perform any upgrades in the future.When does a
*-pending
state occur:To avoid this behaviour this PR introduces a check on
*-pending
Helm release state. If such state is found we assume that an action is currently in progress and setstateNeedsSkip
.The operator checks the
LAST_DEPLOYED
timestamp to check against a timeout of30 seconds
. As long as theLAST_DEPLOYED < 30sec ago
the reconciliation is requeued after 5 seconds.After the timeout is reached a
statedNeedsRollback
is set and a rollback is performed to get rid of the*-pending
state.Testing: