Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Automatically suspend CronWorkflow for repeated errors in child workflows #5659

Open
terrytangyuan opened this issue Apr 12, 2021 · 8 comments

Comments

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 12, 2021

Summary

What change needs making?

Automatically suspend a CronWorkflow for repeated errors in its workflows.

Use Cases

When would you use this?

When there are repeated errors observed in the child workflows, future child workflows would likely also fail. In this case, the controller could automatically suspend CronWorkflow from submitting new child workflows to avoid waste of computational resources.

Implementation =-wise, we could:

  1. Introduce a field suspendFailedJobsLimit bool (probably need a better name) in CronWorkflow to specify the maximum number of consecutive failed workflows before we suspend the corresponding CronWorkflow.
  2. Record statuses of child workflows in CronController.
  3. Suspend the corresponding CronWorkflow when the threshold is reached.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@terrytangyuan terrytangyuan added the type/feature Feature request label Apr 12, 2021
@terrytangyuan
Copy link
Member Author

@alexec I can work on this if you think this is a reasonable feature to have. We currently have a hacky workaround and it would be good to implement the logic in the controller itself.

@simster7
Copy link
Member

I think this is a good idea

@terrytangyuan
Copy link
Member Author

I created a quick draft PR in #5662 and left some comments so that we can discuss pros and cons of different designs in several places.

@terrytangyuan
Copy link
Member Author

Addressed in #12305

@agilgur5 agilgur5 changed the title [Proposal] Automatically suspend a CronWorkflow for repeated errors in its child workflows Proposal: Automatically suspend CronWorkflow for repeated errors in child workflows Sep 29, 2024
@agilgur5
Copy link
Member

agilgur5 commented Sep 29, 2024

Addressed in #12305

Automatically suspend

@terrytangyuan per #12696 (comment), #12305's stopStrategy completely stops a CronWorkflow, it cannot be resumed after. The term suspend in Argo refers to a temporary pause that can be manually resumed again.

So it seems like this is not entirely addressed in #12305, a separate suspendStrategy would be needed

@Joibel
Copy link
Member

Joibel commented Sep 30, 2024

This feels like just a difference in the words used, not a difference in function.

You could patch the phase away from Stopped and remove the label and it would be resumed. I'm slightly unsure why we have both a label and phase in hindsight now - I feels like we don't need phase.

We don't have a CLI command to patch back to Active, which would be a useful convenience.

@agilgur5
Copy link
Member

No the functional difference was very intentional per #12696 (comment), as linked above. I literally asked about the difference and suggested a rename initially.

You could patch

If you put it back to Active, it would stop again if the history were not modified as well.

We don't have a CLI command to patch back to Active, which would be a useful convenience.

The UI also (intentionally) treats it differently

@agilgur5
Copy link
Member

Re-opening per my above comments, a suspendStrategy would be needed for this and #7291

@agilgur5 agilgur5 reopened this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants