Skip to content

Conversation

@isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 10, 2024

Fixes #12543 and other retry related issues.

Motivation

The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.

In addition to this the current logic does not take care of various edge cases, for example container sets.

It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.

Modifications

This is a complete rewrite of the FormulateRetryWorkflow function.

Verification

Extensive testing performed manually to verify that behaviour is as expected.

@Joibel
Copy link
Member

Joibel commented Oct 17, 2024

This might supersede #12105

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review October 21, 2024 07:12
Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should actually test for individual node status as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

`

func TestStepsRetryWorkflow(t *testing.T) {
assert := assert.New(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this for assert and require here and elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I was waiting for an opinion on #13734 (comment). I plan to use these when I fill in the tests.

I was unsure if it was worth the effort of these tests. I wrote these to test edge conditions and make sure they worked as expected. They will prevent regressions though, so I will add them.

wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns")
createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()}
finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)}
t.Run("Steps", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this entire test feels wrong, why can't it be checked still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really testable easily, we cannot establish a dag for these status fields. The root node is not present.
This will error out complaining that the root node couldn't be found.

Comment on lines 3798 to 3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.


func isGroupNode(node wfv1.NodeStatus) bool {
return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps
func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename to resetUntil

reset instead of consume, as you call the resetFunc. Or call that consumeFunc

till has more meanings, until is much clearer if English isn't your first language.

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
default:
// Do not allow retry of workflows with pods in Running/Pending phase
return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase)
deleteNodes, err := getNodeIDsToResetNoChildren(restartSuccessful, nodeFieldSelector, wf.Status.Nodes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I have to set the --restart-successful in order to identify the nodes that need to be deleted, even if I only wish to retry a specific failed node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this wouldn't be the case

Copy link
Member

@Joibel Joibel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

addToDelete(curr.n.ID, true)
children := getChildren(curr)
for childID := range children {
addToDelete(childID, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I follow and agree now. Thanks.

@isubasinghe isubasinghe merged commit 497f338 into argoproj:main Nov 21, 2024
@Joibel Joibel deleted the fix-retry-logic branch November 29, 2024 10:17
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 5, 2024
Signed-off-by: isubasinghe <isitha@pipekit.io>
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
Signed-off-by: isubasinghe <isitha@pipekit.io>
Joibel pushed a commit to Joibel/argo-workflows that referenced this pull request Dec 6, 2024
Signed-off-by: isubasinghe <isitha@pipekit.io>
@shuangkun
Copy link
Member

Could this PR be cherry-picked into the 3.6.5 branch? Some customers have inquired about it, but I haven't found any prior discussions related to the 3.6.x patches. Thanks @Joibel

@Joibel
Copy link
Member

Joibel commented Mar 6, 2025

I've created a patch releases discussion #14259.

I haven't backported this yet, it feels like #14124 should be addressed before this is backported otherwise we're knowingly introducing a regression.

@Joibel
Copy link
Member

Joibel commented Apr 23, 2025

/cherry-pick release-3.6

Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
(cherry picked from commit 497f338)
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <alan@clucas.org>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <alan@clucas.org>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <alan@clucas.org>
Joibel added a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Joibel pushed a commit that referenced this pull request Apr 28, 2025
Signed-off-by: isubasinghe <isitha@pipekit.io>
(cherry picked from commit 497f338)
Signed-off-by: Alan Clucas <alan@clucas.org>
kim-codefresh added a commit to codefresh-io/argo-workflows that referenced this pull request May 20, 2025
…abilities fixes (Cr 28355) (#358)

* fix: bump deps for k8schain to fix ecr-login (argoproj#14008) (release-3.6 cherry-pick) (argoproj#14174)

* fix(ci): python sdk release process (release-3.6) (argoproj#14183)

Signed-off-by: Alan Clucas <alan@clucas.org>

* docs: clarify qps/burst on controller (cherry-pick argoproj#14190) (argoproj#14192)

Signed-off-by: Tim Collins <tim@thecollins.team>
Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com>

* fix(api/jsonschema): use unchanging JSON Schema version (cherry-pick argoproj#14092) (argoproj#14256)

Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>

* fix(api/jsonschema): use working `$id` (cherry-pick argoproj#14257) (argoproj#14258)

Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>

* docs: autogenerate tested k8s versions and centralize config (argoproj#14176) (release-3.6) (argoproj#14262)

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* chore(deps): bump minio-go to newer version (argoproj#14185) (release-3.6) (argoproj#14261)

Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com>

* fix: split pod controller from workflow controller (argoproj#14129) (release-3.6) (argoproj#14263)

* chore(deps): fix snyk (argoproj#14264) (release-3.6) (argoproj#14268)

* chore: revert to correct k8s versions

Accidental bump from argoproj#14176 cherry-pick

Signed-off-by: Alan Clucas <alan@clucas.org>

* chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 in the go_modules group (cherry-pick argoproj#14231) (argoproj#14269)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: wait for workflow informer to sync before pod informer (cherry-pick argoproj#14248) (argoproj#14266)

Signed-off-by: Rohan K <rohankmr414@gmail.com>
Co-authored-by: Rohan K <rohankmr414@gmail.com>

* fix(cli): remove red from log colour selection. Fixes argoproj#6740 (cherry-pick argoproj#14215) (argoproj#14278)

Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>

* fix: correct semaphore configmap keys for multiple semaphores (argoproj#14184) (release-3.6) (argoproj#14281)

* fix: don't print help for non-validation errors. Fixes argoproj#14234 (cherry-pick argoproj#14249) (argoproj#14283)

Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* docs: fix kubernetes versions (release-3.6) (argoproj#14273)

Signed-off-by: Alan Clucas <alan@clucas.org>

* fix(workflow/sync): use RWMutex to prevent concurrent map access (cherry-pick argoproj#14321) (argoproj#14322)

Signed-off-by: Ryan Currah <ryan@currah.ca>
Co-authored-by: Ryan Currah <ryan@currah.ca>

* chore(lint): update golangci-lint to 2.1.1 (argoproj#14390) (cherry-pick release-3.6) (argoproj#14417)

* chore: bump golang 1.23->1.24 (argoproj#14385) (cherry-pick release-3.6) (argoproj#14418)

* fix: gracefully handle invalid CronWorkflows and simplify logic.  (cherry-pick argoproj#14197) (argoproj#14419)

Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* fix: prevent dfs sorter infinite recursion on cycle. Fixes argoproj#13395 (cherry-pick argoproj#14391) (argoproj#14420)

Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com>

* chore(deps): bump github.com/expr-lang/expr from 1.16.9 to 1.17.0 (argoproj#14307) (release-3.6) (argoproj#14421)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps)!: update k8s and argo-events (release-3.6) (argoproj#14424)

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>

* fix: correct retry logic (argoproj#13734) (release-3.6) (argoproj#14428)

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>

* fix: manual retries exit handler cleanup. Fixes argoproj#14180 (argoproj#14181) (release-3.6) (argoproj#14429)

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>

* fix: correct manual retry logic. Fixes argoproj#14124 (argoproj#14328) (release-3.6) (argoproj#14430)

Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: Alan Clucas <alan@clucas.org>
Co-authored-by: jswxstw <jswxstw@gmail.com>

* fix: disable ALPN in argo-server as a workaround (argoproj#14433)

Signed-off-by: Alan Clucas <alan@clucas.org>

* result of codegen

Signed-off-by: Kim <kim.aharfi@codefresh.io>

* fix:lint

Signed-off-by: Kim <kim.aharfi@codefresh.io>

---------

Signed-off-by: Alan Clucas <alan@clucas.org>
Signed-off-by: Tim Collins <tim@thecollins.team>
Signed-off-by: Roger Peppe <rogpeppe@gmail.com>
Signed-off-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Rohan K <rohankmr414@gmail.com>
Signed-off-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Signed-off-by: Koichi Shimada <jumpe1programming@gmail.com>
Signed-off-by: Ryan Currah <ryan@currah.ca>
Signed-off-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Signed-off-by: william.vanhevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: William Van Hevelingen <william.vanhevelingen@acquia.com>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: oninowang <oninowang@tencent.com>
Signed-off-by: Kim <kim.aharfi@codefresh.io>
Co-authored-by: Alan Clucas <alan@clucas.org>
Co-authored-by: gcp-cherry-pick-bot[bot] <98988430+gcp-cherry-pick-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Collins <45351296+tico24@users.noreply.github.com>
Co-authored-by: Roger Peppe <rogpeppe@gmail.com>
Co-authored-by: Mason Malone <651224+MasonM@users.noreply.github.com>
Co-authored-by: Vaibhav Kaushik <vaibhavkaushik@salesforce.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rohan K <rohankmr414@gmail.com>
Co-authored-by: Prabakaran Kumaresshan <4676330+nixphix@users.noreply.github.com>
Co-authored-by: koichi <51446844+jumpe1@users.noreply.github.com>
Co-authored-by: Ryan Currah <ryan@currah.ca>
Co-authored-by: Adrien Delannoy <a.delannoyfr@gmail.com>
Co-authored-by: William Van Hevelingen <William.VanHevelingen@acquia.com>
Co-authored-by: Isitha Subasinghe <isitha@pipekit.io>
Co-authored-by: jswxstw <jswxstw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retrying specific failed node does not work

5 participants