-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct retry logic #13734
base: main
Are you sure you want to change the base?
fix: correct retry logic #13734
Conversation
This might supersede #12105 |
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
Signed-off-by: isubasinghe <isitha@pipekit.io>
abe5b8f
to
57e7638
Compare
Signed-off-by: isubasinghe <isitha@pipekit.io>
func TestNestedDAG(t *testing.T) { | ||
require := require.New(t) | ||
wf := wfv1.MustUnmarshalWorkflow(nestedDAG) | ||
|
||
newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{}) | ||
require.NoError(err) | ||
_ = newWf | ||
_ = podsToDelete | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should actually test for individual node status as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
} | ||
|
||
assert.Equal(1, numNilParent) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Blank line
` | ||
|
||
func TestStepsRetryWorkflow(t *testing.T) { | ||
assert := assert.New(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do this for assert and require here and elsewhere?
@@ -1026,51 +1025,6 @@ func TestRetryExitHandler(t *testing.T) { | |||
func TestFormulateRetryWorkflow(t *testing.T) { | |||
ctx := context.Background() | |||
wfClient := argofake.NewSimpleClientset().ArgoprojV1alpha1().Workflows("my-ns") | |||
createdTime := metav1.Time{Time: time.Now().Add(-1 * time.Second).UTC()} | |||
finishedTime := metav1.Time{Time: createdTime.Add(time.Second * 2)} | |||
t.Run("Steps", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping this entire test feels wrong, why can't it be checked still?
func TestNestedDAG(t *testing.T) { | ||
require := require.New(t) | ||
wf := wfv1.MustUnmarshalWorkflow(nestedDAG) | ||
|
||
newWf, podsToDelete, err := FormulateRetryWorkflow(context.Background(), wf, true, "id=dag-nested-zxlc2-744943701", []string{}) | ||
require.NoError(err) | ||
_ = newWf | ||
_ = podsToDelete | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
} | ||
|
||
func isGroupNode(node wfv1.NodeStatus) bool { | ||
return node.Type == wfv1.NodeTypeDAG || node.Type == wfv1.NodeTypeTaskGroup || node.Type == wfv1.NodeTypeStepGroup || node.Type == wfv1.NodeTypeSteps | ||
func consumeTill(n *node, should tillFn, resetFunc resetFn) (*node, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename to resetUntil
reset
instead of consume, as you call the resetFunc
. Or call that consumeFunc
till
has more meanings, until
is much clearer if English isn't your first language.
nodesToReset[nodeID] = true | ||
} | ||
|
||
addToDelete := func(nodeID string, addToNode bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addToNode
is always true
, so simplify
return nodesToReset, nodesToDelete, nil | ||
} | ||
|
||
addToReset := func(nodeID string, addToNode bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addToNode
is always false
, so simplify
return false | ||
} | ||
|
||
// dagSortedNodes sorts the nodes based on the order they were created, omits onExitNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not in creation order as children are all appended batched without regard to order.
If task-1
finishes later than task-2
then task-1's children may get created later than task-2's but that won't be reflected here.
nodeIDsToReset, err := getNodeIDsToReset(restartSuccessful, nodeFieldSelector, wf.Status.Nodes) | ||
if err != nil { | ||
return nil, nil, err | ||
for k := range toReset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap k
for nodeName
. Short variable names for array indices are OK, but here it's a map key and more confusing if not named for what it is.
return nil, nil, errors.InternalErrorf("Workflow cannot be retried with node %s in %s phase", node.Name, node.Phase) | ||
} | ||
if n.Name == wf.Name && !shouldRetryFailedType(n.Type) { | ||
newWf.Status.Nodes.Set(id, resetNode(*n.DeepCopy())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this DeepCopy
, it seems to me that resetNode
will cause a copy?
Fixes #12543 and other retry related issues.
Motivation
The current retry logic is too simplistic, it relies only on resetting nodes by traversing boundary nodes.
This approach does not work, it needs to be a combination of manual traversal and boundary node traversal.
In addition to this the current logic does not take care of various edge cases, for example container sets.
It was clear some rethinking needed to be done such that:
a) edge cases were taken care of.
b) retry logic was simpler to understand and hopefully fix if any further bug were to arise.
Modifications
This is a complete rewrite of the
FormulateRetryWorkflow
function.Verification
Extensive testing performed manually to verify that behaviour is as expected.