-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand docs for aborting workflows #12800
Comments
disambiguation
That is exactly why I did clearly disambiguate these in the CLI docs in #11624, #11625, and #11626. Anecdotally, I haven't seen the same level of confusion since those improvements; I actually haven't really needed to refer to them since I wrote them.
this was the topic of #11624 / #11511.
these are also quite different.
effectively each of these is just a shorthand for the other.
The latter two are identical. low-level detailsFor a large chunk of the rest, those are pretty low-level questions. They're not bad questions, but they are not the most pressing questions to most users. This is actually my first time seeing a lot of these questions. Several of your questions are also combinations of features. For the most part, features are made to work independently of each other. Some of them are also implementation details, and some of those are k8s implementation details not specific to Argo. They may be good to know, but some may be more suitable as plain API docs (similar to the Fields Reference) and others are out of scope. Others are pretty nuanced race conditions, and I don't know if I've ever seen detailed docs about what race conditions to expect when your code is in forcibly stopped into an error / cancelled state. It's an uncommon scenario and races can result in indeterminate behavior in most programming languages. Related, writing idempotent workflows is not specific to Argo but is a general best practice so that you can reconstruct state whenever needed. independent features
ArtifactGC is a more recent feature with its own docs.
Similarly this is the topic of PodGC.
The Workflow Archive just takes a completed Workflow resource and puts it into a DB. There is an There are of course temporary race conditions when a Workflow is in both (that is unavoidable); the one in-cluster is always preferred to the one in the DB in those cases. Executor dependent
Yea these do not have much docs on them. Although it largely follows k8s, which largely follows Unix. Note that I also did not need this knowledge until quite literally earlier this week as I was responding to some quite nuanced sidecar and executor issues/bugs and did an hours-long deep dive. These are also heavily dependent on your choice of Executor. Nowadays, it's all Emissary, but Executors have went through some iterations and likely will still have more iterations. Part of these iterations are also due to k8s itself evolving its runtime and security model etc. Emissary is probably the least intrusive Executor so far.
also note that Argo does not know what container you're running or how they work. That's about all I got in me for now. Note that both your questions and some of my limited answers here are well over a page in size all without a single Workflow manifest as an example -- that is a very significant amount of detail. That would also suggest that a single docs page would likely not be sufficient. I would encourage splitting up some of this into more digestable pieces. That would also help with organizing any docs that would come out of this. |
This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs. |
This issue has been closed due to inactivity and lack of information. If you still encounter this issue, please add the requested information and re-open. |
Docs request
I'd like a page prominently added to the user-guide that explains in detail the process for aborting workflows.
(The page should also cover salvaging interrupted workflows, re-running workflows from scratch, and cleaning up after workflows regardless of how they ended.)
Use-case
It is commonplace that as a user is developing a large workflow they may mistakenly spawn an expensive processing job that does not function as intended, realise their error, and want to urgently relinquish all the compute resources. Next they will want to restore the environment to a clean state from which the job can be rerun successfully, or alternatively they may want to salvage any check-pointed work.
There are various relevant commands e.g.
stop
,terminate
andsuspend
. (Alsoresubmit
,resume
andretry
.) The docs should give advice for choosing between these commands, understanding the consequences, and cleaning up afterward.The guidance should also support users to develop workflows (and container images) that are robust, that shut down gracefully, that save progress at check-points, and do not need fiddly clean up. Should also give users more confidence to kick off large workflows, knowing how to monitor and reliably abort in case of any problem.
Currently the docs for the CLI commands are very terse, and several users have asked for more clarification via discussion in issues (e.g. #4454, #11511, #2742, etc).
Outline of proposed content
Explain what happens to already-running subtasks (e.g. what signals the controller will cause to be sent to running pods, and what governs this sequence). Will the existing pods get interrupted immediately or will they run until a checkpoint?
Explain what happens to stored inputs, outputs, artifacts, etc. In which cases are they retained vs expunged? If retained, are there any issues to be aware of for picking up again where left off? What steps are needed to force purging of stored artifacts etc? By default will this partially generated data be retained indefinitely, or expire after some period? Are there any issues to ensure consecutive runs of a workflow cannot interfere with each other (or conversely, to enable them to leverage previously-generated data)?
Explain the circumstances where an aborted workflow can be salvaged (either fully salvaging the progress and completing the workflow, or just salvaging some intermediate data).
Explain where the archive fits in. (Do different methods of aborting and replacing a workflow alter the archive lifecycle? Can the archive also be used to recover data from, and to resume, previously forcefully aborted runs?)
Disambiguate all related argo commands:
stop
/terminate
(differs only in whether the shutdown handlers specified in the workflow are invoked),terminate
/delete
/kubectl delete
(is there any difference for still-running pods, artefacts, etc?)stop
/suspend
(??)resubmit
/re-submit
/re-kubectl apply
(differs only in parameter inheritance?),retry
/resubmit
(differs in whether steps that previously completed successfully are rerun again)retry
/resume
(??)Generally impart context illuminating the workflow lifecycle and the interacting components that govern it. For example, are these CLI commands just updating status metadata fields in the k8s workflow resource? What actions will the workflow controller take in response? Will it delete the worker pods, leaving it up to other k8s control components to manage a gracefully staged shutdown? How does the workflows controller ascertain whether the subtask was finished? Is disruption liable to disrupt the capture and preservation of intermediate outputs (e.g. as artifacts) and the invocation of workflow-specified handlers (e.g., is this implemented with sidecar pods and how do they tolerate shutdown signals)? Should include mention of cluster autoscaling delays, which may limit how rapidly the physical compute resources can be relinquished (e.g. back to a cloud provider) when a user wishes to abort their run.
This should also amount to guidance regarding how to package container images for argo workflows, in order to ensure the worker processes actually will shut down gracefully (listening for expected signals and responding in expected timeframes), and will do whatever is necessary to preserve intermediate work to the maximum possible extent, but discourage attempting to implement any checkpointing model that would be incongruent with the argo workflows' lifecycle model.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritize the proposals with the most 👍.
The text was updated successfully, but these errors were encountered: