Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct retry logic #13734

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

isubasinghe
Copy link
Member

@isubasinghe isubasinghe commented Oct 10, 2024

Fixes #12543 and other retry related issues.

Motivation

The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.

In addition to this the current logic does not take care of various edge cases, for example container sets.

It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.

Modifications

This is a complete rewrite of the FormulateRetryWorkflow function.

Verification

Extensive testing performed manually to verify that behaviour is as expected.

@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 10, 2024
@Joibel
Copy link
Member

Joibel commented Oct 17, 2024

This might supersede #12105

Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
@isubasinghe isubasinghe marked this pull request as ready for review October 21, 2024 07:12
Comment on lines +3798 to +3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should actually test for individual node status as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

}

assert.Equal(1, numNilParent)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Blank line

`

func TestStepsRetryWorkflow(t *testing.T) {
assert := assert.New(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this for assert and require here and elsewhere?

@@ -1026,51 +1025,6 @@ func TestRetryExitHandler(t *testing.T) {
func TestFormulateRetryWorkflow(t *testing.T) {
ctx := context.Background()
wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns")
createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()}
finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)}
t.Run("Steps", func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this entire test feels wrong, why can't it be checked still?

Comment on lines +3798 to +3807
func TestNestedDAG(t *testing.T) {
require := require.New(t)
wf := wfv1.MustUnmarshalWorkflow(nestedDAG)

newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{})
require.NoError(err)
_ = newWf
_ = podsToDelete

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

}

func isGroupNode(node wfv1.NodeStatus) bool {
return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps
func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename to resetUntil

reset instead of consume, as you call the resetFunc. Or call that consumeFunc

till has more meanings, until is much clearer if English isn't your first language.

nodesToReset[nodeID] = true
}

addToDelete := func(nodeID string, addToNode bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addToNode is always true, so simplify

return nodesToReset, nodesToDelete, nil
}

addToReset := func(nodeID string, addToNode bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addToNode is always false, so simplify

return false
}

// dagSortedNodes sorts the nodes based on the order they were created, omits onExitNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not in creation order as children are all appended batched without regard to order.

If task-1 finishes later than task-2 then task-1's children may get created later than task-2's but that won't be reflected here.

nodeIDsToReset, err := getNodeIDsToReset(restartSuccessful, nodeFieldSelector, wf.Status.Nodes)
if err != nil {
return nil, nil, err
for k := range toReset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap k for nodeName. Short variable names for array indices are OK, but here it's a map key and more confusing if not named for what it is.

return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase)
}
if n.Name == wf.Name && !shouldRetryFailedType(n.Type) {
newWf.Status.Nodes.Set(id, resetNode(*n.DeepCopy()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this DeepCopy, it seems to me that resetNode will cause a copy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retrying specific failed node does not work
3 participants