Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix fastFail flag bug. Fixes #10312 #11992

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JasonChen86899
Copy link

@JasonChen86899 JasonChen86899 commented Oct 13, 2023

Fixes #10312

Modifications

v1:

  1. Modify dag.go: funcassessDAGPhase funcevaluateDependsLogic
  2. Modify 'workflow_phase.go': add WorkflowCanceled phase

v2:

  1. Modify dag.go: funcassessDAGPhase funcexecuteDAGTask
  2. Add func in dag.go: funccheckDAGFastFailed

Verification

  1. test fail fast flag and it takes effect
image image

@JasonChen86899
Copy link
Author

@agilgur5

@JasonChen86899
Copy link
Author

JasonChen86899 commented Oct 20, 2023

@agilgur5 please review this PR

@@ -10,6 +10,7 @@ const (
WorkflowSucceeded WorkflowPhase = "Succeeded"
WorkflowFailed WorkflowPhase = "Failed" // it maybe that the workflow was terminated
WorkflowError WorkflowPhase = "Error"
WorkflowCanceled WorkflowPhase = "Canceled" // it is an intermediate state when enable failFast. Workflow phase will be changed from Canceled to Succeeded/Failed/Error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need a new phase for this since it's a one-off use case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need a new phase for this since it's a one-off use case

This is indeed a controversial place, this status just is used to record that this workflow has failed fast. If this status is unreasonable, I can also add hasFailedFast field in WorkflowStatus struct. Do you have any suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrytangyuan I'd love for this to become a permanent phase as requested in #11849. This change would allow for that, as a side effect. If we could leave as an actual phase, in another PR we could update the UI and API to allow filtering of this phase. 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a new phase would be a much bigger change with potential side effects since we would need to go through the codebase to make sure we handle the phases correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Just a brief look on my own has shown it is a substantial change. Still might take a stab at it. 😃 will see.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my personal understanding of the source code logic, adding such a phase can be seen from two perspectives. If it is only an intermediate state, it will not actually have an impact. If it is the final state, it requires comprehensive code testing.

So what should we do next about this pr? How about add hasFailedFast field in WorkflowStatus struct? I need a field in status to indicate that the current workflow needs to fail fast

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I did mention it would require a sizeable refactor in #11849 (comment). It should almost certainly be its own change.

RE: failFast, I don't think it would be an appropriate place for a Workflow-level Cancelled status either. failFast could potentially use a Node-level Cancelled status (as opposed to Skipped) as mentioned in #10312 (comment).

So what should we do next about this pr? How about add hasFailedFast field in WorkflowStatus struct? I need a field in status to indicate that the current workflow needs to fail fast

Are you unable to infer the status without such a field? If failFast is true or unset, and a node in a branch of the DAG has failed, then we can infer that the rest of the Workflow should fail fast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you unable to infer the status without such a field? If failFast is true or unset, and a node in a branch of the DAG has failed, then we can infer that the rest of the Workflow should fail fast.

At this time the logic of DAG is a loop: ExecuteDAG ->AssessDAGPhase ->End/Continue running. A BFS traversal and the result calculation of the target tasks has already been done in AssessDAGPhase. My design is to save the results of the last traversal to a field in the status, so that the next loop will directly obtain the fastFail judgment result, without the need for another iteration and calculation of the target tasks.

I want to know which design you prefer so that I can make further modifications

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agilgur5 I have modified logic, please review the logic code again

@sarabala1979
Copy link
Member

sarabala1979 commented Oct 23, 2023

Just I would like to make sure we understand the functionality of flags

When I was using FailFast, I found that this parameter was invalid. Regardless of setting True/False, step C failed, and BDE would execute it.'

FailFast means fail the DAG soon if any of the DAG nodes fail irrespective of the branch. The original issue can be addressed using continueOn if they want to skip failed and continue the rest of the branches.

Here is the purpose/functionality of fastfail and continueon

	// This flag is for DAG logic. The DAG logic has a built-in "fail fast" feature to stop scheduling new steps,
	// as soon as it detects that one of the DAG nodes is failed. Then it waits until all DAG nodes are completed
	// before failing the DAG itself.
	// The FailFast flag default is true,  if set to false, it will allow a DAG to run all branches of the DAG to
	// completion (either success or failure), regardless of the failed outcomes of branches in the DAG.
	// More info and example about this feature at https://github.com/argoproj/argo-workflows/issues/1442
	FailFast *bool `json:"failFast,omitempty" protobuf:"varint,3,opt,name=failFast"`
	// ContinueOn makes argo to proceed with the following task even if this task fails.
	// Errors and Failed states can be specified
	ContinueOn *ContinueOn `json:"continueOn,omitempty" protobuf:"bytes,10,opt,name=continueOn"`

failFast: false

if DAG fails before executing BDE, it is a bag and it should be fixed where we are evaluating.

Are HOOKs also not working in failFast scenario?

@JasonChen86899
Copy link
Author

Are HOOKs also not working in failFast scenario?

The previous logic was failFast not working but ensure that Hooks would be executed. In this PR modification, I also ensured that Hooks would be executed, but the Hooks of the Skipped node would not be executed. I think this should be logical and all test cases passed.

@agilgur5
Copy link
Member

agilgur5 commented Oct 25, 2023

FailFast means fail the DAG soon if any of the DAG nodes fail irrespective of the branch. The original issue can be addressed using continueOn if they want to skip failed and continue the rest of the branches.

@sarabala1979 the issue and comments state that the failFast flag has no effect currently, regardless of what it is set to. They set it as a test, not as a use-case. Users wanted the rest of the branches to fail fast but they don't, despite documentation of the feature; that is a bug

@agilgur5

I was out sick for a bit per my GH status, still catching up. Thanks to other folks reviewing in the interim! Also left a comment in-line above

Signed-off-by: Goober <chenhao86899@gmail.com>
Signed-off-by: Goober <chenhao86899@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failFast field doesn't work
5 participants