-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix fastFail flag bug. Fixes #10312 #11992
base: main
Are you sure you want to change the base?
Conversation
67dab7c
to
f053dea
Compare
594dc66
to
4aa9011
Compare
78c6197
to
9b1766f
Compare
@agilgur5 please review this PR |
@@ -10,6 +10,7 @@ const ( | |||
WorkflowSucceeded WorkflowPhase = "Succeeded" | |||
WorkflowFailed WorkflowPhase = "Failed" // it maybe that the workflow was terminated | |||
WorkflowError WorkflowPhase = "Error" | |||
WorkflowCanceled WorkflowPhase = "Canceled" // it is an intermediate state when enable failFast. Workflow phase will be changed from Canceled to Succeeded/Failed/Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need a new phase for this since it's a one-off use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need a new phase for this since it's a one-off use case
This is indeed a controversial place, this status just is used to record that this workflow has failed fast. If this status is unreasonable, I can also add hasFailedFast
field in WorkflowStatus
struct. Do you have any suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrytangyuan I'd love for this to become a permanent phase as requested in #11849. This change would allow for that, as a side effect. If we could leave as an actual phase, in another PR we could update the UI and API to allow filtering of this phase. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing a new phase would be a much bigger change with potential side effects since we would need to go through the codebase to make sure we handle the phases correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Just a brief look on my own has shown it is a substantial change. Still might take a stab at it. 😃 will see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my personal understanding of the source code logic, adding such a phase can be seen from two perspectives. If it is only an intermediate state, it will not actually have an impact. If it is the final state, it requires comprehensive code testing.
So what should we do next about this pr? How about add hasFailedFast
field in WorkflowStatus
struct? I need a field in status to indicate that the current workflow needs to fail fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I did mention it would require a sizeable refactor in #11849 (comment). It should almost certainly be its own change.
RE: failFast
, I don't think it would be an appropriate place for a Workflow-level Cancelled
status either. failFast
could potentially use a Node-level Cancelled
status (as opposed to Skipped
) as mentioned in #10312 (comment).
So what should we do next about this pr? How about add
hasFailedFast
field inWorkflowStatus
struct? I need a field in status to indicate that the current workflow needs to fail fast
Are you unable to infer the status without such a field? If failFast
is true
or unset, and a node in a branch of the DAG has failed, then we can infer that the rest of the Workflow should fail fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you unable to infer the status without such a field? If
failFast
istrue
or unset, and a node in a branch of the DAG has failed, then we can infer that the rest of the Workflow should fail fast.
At this time the logic of DAG is a loop: ExecuteDAG
->AssessDAGPhase
->End/Continue running. A BFS traversal and the result calculation of the target tasks has already been done in AssessDAGPhase
. My design is to save the results of the last traversal to a field in the status, so that the next loop will directly obtain the fastFail judgment result, without the need for another iteration and calculation of the target tasks.
I want to know which design you prefer so that I can make further modifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agilgur5 I have modified logic, please review the logic code again
Just I would like to make sure we understand the functionality of flags
Here is the purpose/functionality of // This flag is for DAG logic. The DAG logic has a built-in "fail fast" feature to stop scheduling new steps,
// as soon as it detects that one of the DAG nodes is failed. Then it waits until all DAG nodes are completed
// before failing the DAG itself.
// The FailFast flag default is true, if set to false, it will allow a DAG to run all branches of the DAG to
// completion (either success or failure), regardless of the failed outcomes of branches in the DAG.
// More info and example about this feature at https://github.com/argoproj/argo-workflows/issues/1442
FailFast *bool `json:"failFast,omitempty" protobuf:"varint,3,opt,name=failFast"` // ContinueOn makes argo to proceed with the following task even if this task fails.
// Errors and Failed states can be specified
ContinueOn *ContinueOn `json:"continueOn,omitempty" protobuf:"bytes,10,opt,name=continueOn"` failFast: false if DAG fails before executing BDE, it is a bag and it should be fixed where we are evaluating. Are HOOKs also not working in |
The previous logic was |
@sarabala1979 the issue and comments state that the I was out sick for a bit per my GH status, still catching up. Thanks to other folks reviewing in the interim! Also left a comment in-line above |
8210a1e
to
6fa3e4c
Compare
1fdb3f6
to
92e22ea
Compare
Signed-off-by: Goober <chenhao86899@gmail.com>
Signed-off-by: Goober <chenhao86899@gmail.com>
Fixes #10312
Modifications
v1:
dag.go
: funcassessDAGPhase
funcevaluateDependsLogic
WorkflowCanceled
phasev2:
dag.go
: funcassessDAGPhase
funcexecuteDAGTask
dag.go
: funccheckDAGFastFailed
Verification