Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: failureStrategy for DAGs and expandable tempaltes #5398

Open
simster7 opened this issue Mar 15, 2021 · 1 comment
Open

Proposal: failureStrategy for DAGs and expandable tempaltes #5398

simster7 opened this issue Mar 15, 2021 · 1 comment
Labels
area/controller Controller issues, panics area/spec Changes to the workflow specification. area/templates/dag type/feature Feature request

Comments

@simster7
Copy link
Member

simster7 commented Mar 15, 2021

Summary

Support more advanced strategies for when failing from a DAG or Steps template.

Use Cases

Currently a DAG has failFast, and recently templates also support failFast in conjunction with parallelism and with{Items,Prams}, etc. However, this could be extended further.

Something like:

failureStrategy:
  when: "{{numberFailed}} > 2 || {{numberSkipped}} > 0"
  terminateRunningPods: true     # or `false` to allow them to complete

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@simster7 simster7 added the type/feature Feature request label Mar 15, 2021
@simster7 simster7 changed the title failureStrategy Proposal: failureStrategy for DAGs and expandable tempaltes Mar 15, 2021
@jli
Copy link

jli commented Aug 19, 2022

My ideal failFast behavior is: the workflow should retry tasks that have retryStrategy set, but once all retries are exhausted, the workflow should stop scheduling other tasks and fail.

I'm on Argo 3.1, and I can't get this behavior. Is it possible, or is it expected that this doesn't work?

I tried the following permutations:

  • by default without any explicit configuration, my workflows don't fail fast: if a task fails, Argo tries to run the rest of the workflow instead of failing sooner.
  • when I add failFast: true to all my dag templates*, I do get failFast behavior, but then failed tasks don't retry, and the first task failure results in the workflow failing. This is failing too fast.
  • when I add failFast: true to only 1 of the dag templates, retries work but then I don't get failFast behavior. I tried adding failFast to each dag template separately, and all had the same behavior.

*all my dag templates are these 4:

  1. the outermost dag. this is the entrypoint, and just calls exit-handler-1
  2. exit-handler-1: just contains subgraph-2
  3. subgraph-2: contains my actual workflow tasks
  4. fol-loop-4: i have a for-loop for 5-fold cross validation

@agilgur5 agilgur5 added area/controller Controller issues, panics area/spec Changes to the workflow specification. and removed area/templates/steps labels Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/spec Changes to the workflow specification. area/templates/dag type/feature Feature request
Projects
None yet
Development

No branches or pull requests

4 participants