-
Notifications
You must be signed in to change notification settings - Fork 3.5k
fix: correct retry logic #13734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct retry logic #13734
Conversation
|
This might supersede #12105 |
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
abe5b8f to
57e7638
Compare
Signed-off-by: isubasinghe <isitha@pipekit.io>
| func TestNestedDAG(t *testing.T) { | ||
| require := require.New(t) | ||
| wf := wfv1.MustUnmarshalWorkflow(nestedDAG) | ||
|
|
||
| newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{}) | ||
| require.NoError(err) | ||
| _ = newWf | ||
| _ = podsToDelete | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should actually test for individual node status as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
| ` | ||
|
|
||
| func TestStepsRetryWorkflow(t *testing.T) { | ||
| assert := assert.New(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do this for assert and require here and elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I was waiting for an opinion on #13734 (comment). I plan to use these when I fill in the tests.
I was unsure if it was worth the effort of these tests. I wrote these to test edge conditions and make sure they worked as expected. They will prevent regressions though, so I will add them.
| wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns") | ||
| createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()} | ||
| finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)} | ||
| t.Run("Steps", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping this entire test feels wrong, why can't it be checked still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really testable easily, we cannot establish a dag for these status fields. The root node is not present.
This will error out complaining that the root node couldn't be found.
| func TestNestedDAG(t *testing.T) { | ||
| require := require.New(t) | ||
| wf := wfv1.MustUnmarshalWorkflow(nestedDAG) | ||
|
|
||
| newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{}) | ||
| require.NoError(err) | ||
| _ = newWf | ||
| _ = podsToDelete | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
workflow/util/util.go
Outdated
|
|
||
| func isGroupNode(node wfv1.NodeStatus) bool { | ||
| return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps | ||
| func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename to resetUntil
reset instead of consume, as you call the resetFunc. Or call that consumeFunc
till has more meanings, until is much clearer if English isn't your first language.
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
workflow/util/util.go
Outdated
| default: | ||
| // Do not allow retry of workflows with pods in Running/Pending phase | ||
| return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase) | ||
| deleteNodes, err := getNodeIDsToResetNoChildren(restartSuccessful, nodeFieldSelector, wf.Status.Nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would I have to set the --restart-successful in order to identify the nodes that need to be deleted, even if I only wish to retry a specific failed node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this wouldn't be the case
Joibel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
workflow/util/util.go
Outdated
| addToDelete(curr.n.ID, true) | ||
| children := getChildren(curr) | ||
| for childID := range children { | ||
| addToDelete(childID, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I follow and agree now. Thanks.
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
|
Could this PR be cherry-picked into the 3.6.5 branch? Some customers have inquired about it, but I haven't found any prior discussions related to the 3.6.x patches. Thanks @Joibel |
|
/cherry-pick release-3.6 |
Signed-off-by: isubasinghe <isitha@pipekit.io> (cherry picked from commit 497f338)
Signed-off-by: isubasinghe <isitha@pipekit.io> (cherry picked from commit 497f338) Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: isubasinghe <isitha@pipekit.io> (cherry picked from commit 497f338) Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: isubasinghe <isitha@pipekit.io> (cherry picked from commit 497f338) Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: isubasinghe <isitha@pipekit.io> Signed-off-by: Alan Clucas <alan@clucas.org> Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io> (cherry picked from commit 497f338) Signed-off-by: Alan Clucas <alan@clucas.org>
…abilities fixes (Cr 28355) (#358) * fix: bump deps for k8schain to fix ecr-login (argoproj#14008) (release-3.6 cherry-pick) (argoproj#14174) * fix(ci): python sdk release process (release-3.6) (argoproj#14183) Signed-off-by: Alan Clucas <alan@clucas.org> * docs: clarify qps/burst on controller (cherry-pick argoproj#14190) (argoproj#14192) Signed-off-by: Tim Collins <tim@thecollins.team> Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com> * fix(api/jsonschema): use unchanging JSON Schema version (cherry-pick argoproj#14092) (argoproj#14256) Signed-off-by: Roger Peppe <rogpeppe@gmail.com> Co-authored-by: Roger Peppe <rogpeppe@gmail.com> * fix(api/jsonschema): use working `$id` (cherry-pick argoproj#14257) (argoproj#14258) Signed-off-by: Roger Peppe <rogpeppe@gmail.com> Co-authored-by: Roger Peppe <rogpeppe@gmail.com> * docs: autogenerate tested k8s versions and centralize config (argoproj#14176) (release-3.6) (argoproj#14262) Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com> Signed-off-by: Alan Clucas <alan@clucas.org> Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com> * chore(deps): bump minio-go to newer version (argoproj#14185) (release-3.6) (argoproj#14261) Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com> * fix: split pod controller from workflow controller (argoproj#14129) (release-3.6) (argoproj#14263) * chore(deps): fix snyk (argoproj#14264) (release-3.6) (argoproj#14268) * chore: revert to correct k8s versions Accidental bump from argoproj#14176 cherry-pick Signed-off-by: Alan Clucas <alan@clucas.org> * chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 in the go_modules group (cherry-pick argoproj#14231) (argoproj#14269) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * fix: wait for workflow informer to sync before pod informer (cherry-pick argoproj#14248) (argoproj#14266) Signed-off-by: Rohan K <rohankmr414@gmail.com> Co-authored-by: Rohan K <rohankmr414@gmail.com> * fix(cli): remove red from log colour selection. Fixes argoproj#6740 (cherry-pick argoproj#14215) (argoproj#14278) Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com> Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com> * fix: correct semaphore configmap keys for multiple semaphores (argoproj#14184) (release-3.6) (argoproj#14281) * fix: don't print help for non-validation errors. Fixes argoproj#14234 (cherry-pick argoproj#14249) (argoproj#14283) Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com> Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com> Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com> Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com> * docs: fix kubernetes versions (release-3.6) (argoproj#14273) Signed-off-by: Alan Clucas <alan@clucas.org> * fix(workflow/sync): use RWMutex to prevent concurrent map access (cherry-pick argoproj#14321) (argoproj#14322) Signed-off-by: Ryan Currah <ryan@currah.ca> Co-authored-by: Ryan Currah <ryan@currah.ca> * chore(lint): update golangci-lint to 2.1.1 (argoproj#14390) (cherry-pick release-3.6) (argoproj#14417) * chore: bump golang 1.23->1.24 (argoproj#14385) (cherry-pick release-3.6) (argoproj#14418) * fix: gracefully handle invalid CronWorkflows and simplify logic. (cherry-pick argoproj#14197) (argoproj#14419) Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com> * fix: prevent dfs sorter infinite recursion on cycle. Fixes argoproj#13395 (cherry-pick argoproj#14391) (argoproj#14420) Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com> Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com> * chore(deps): bump github.com/expr-lang/expr from 1.16.9 to 1.17.0 (argoproj#14307) (release-3.6) (argoproj#14421) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore(deps)!: update k8s and argo-events (release-3.6) (argoproj#14424) Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com> Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com> Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com> Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com> * fix: correct retry logic (argoproj#13734) (release-3.6) (argoproj#14428) Signed-off-by: isubasinghe <isitha@pipekit.io> Signed-off-by: Alan Clucas <alan@clucas.org> Co-authored-by: Isitha Subasinghe <isitha@pipekit.io> * fix: manual retries exit handler cleanup. Fixes argoproj#14180 (argoproj#14181) (release-3.6) (argoproj#14429) Signed-off-by: isubasinghe <isitha@pipekit.io> Signed-off-by: Alan Clucas <alan@clucas.org> Co-authored-by: Isitha Subasinghe <isitha@pipekit.io> * fix: correct manual retry logic. Fixes argoproj#14124 (argoproj#14328) (release-3.6) (argoproj#14430) Signed-off-by: oninowang <oninowang@tencent.com> Signed-off-by: Alan Clucas <alan@clucas.org> Co-authored-by: jswxstw <jswxstw@gmail.com> * fix: disable ALPN in argo-server as a workaround (argoproj#14433) Signed-off-by: Alan Clucas <alan@clucas.org> * result of codegen Signed-off-by: Kim <kim.aharfi@codefresh.io> * fix:lint Signed-off-by: Kim <kim.aharfi@codefresh.io> --------- Signed-off-by: Alan Clucas <alan@clucas.org> Signed-off-by: Tim Collins <tim@thecollins.team> Signed-off-by: Roger Peppe <rogpeppe@gmail.com> Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Rohan K <rohankmr414@gmail.com> Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com> Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com> Signed-off-by: Ryan Currah <ryan@currah.ca> Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com> Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com> Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com> Signed-off-by: isubasinghe <isitha@pipekit.io> Signed-off-by: oninowang <oninowang@tencent.com> Signed-off-by: Kim <kim.aharfi@codefresh.io> Co-authored-by: Alan Clucas <alan@clucas.org> Co-authored-by: gcp-cherry-pick-bot[bot] <98988430+gcp-cherry-pick-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com> Co-authored-by: Roger Peppe <rogpeppe@gmail.com> Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com> Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Rohan K <rohankmr414@gmail.com> Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com> Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com> Co-authored-by: Ryan Currah <ryan@currah.ca> Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com> Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com> Co-authored-by: Isitha Subasinghe <isitha@pipekit.io> Co-authored-by: jswxstw <jswxstw@gmail.com>
Fixes #12543 and other retry related issues.
Motivation
The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.
In addition to this the current logic does not take care of various edge cases, for example container sets.
It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.
Modifications
This is a complete rewrite of the
FormulateRetryWorkflowfunction.Verification
Extensive testing performed manually to verify that behaviour is as expected.