Pod continues running after Workflow is stopped #10658

rajaie-sg · 2023-03-09T19:03:13Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We have cluster-autoscaler installed on our Kubernetes cluster to automatically scale up nodes when none are available.

If I start a Workflow which runs a Pod, and then stop that workflow before the Pod starts running (so, while the Pod is in a Pending state, waiting to be scheduled onto a node), the Workflow stops fine but the Pod continues running until all the containers in it complete.

It takes a few minutes for the autoscaler to bring up a new node, so if someone triggers a workflow, then aborts it in the few minutes it takes to scale up the node and schedule the pod, then the pod may end up in a zombie state where it's running while the Workflow has already been stopped.

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow that triggers a Pod should work. I triggered a workflow `postman-test-5lrl7`

Logs from the workflow controller

time="2023-03-09T19:08:26.671Z" level=info msg="Processing workflow" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node unchanged" namespace=argo nodeID=postman-test-5lrl7-2114579652 workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=argo podName=postman-test-5lrl7-entrypoint-2114579652 shutdownStrategy=Stop workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 phase Pending -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 message: workflow shutdown with strategy:  Stop" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 finished: 2023-03-09 19:08:26.6743069 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 phase Pending -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 message: workflow shutdown with strategy:  Stop" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 finished: 2023-03-09 19:08:26.674350294 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 phase Running -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 message: Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 finished: 2023-03-09 19:08:26.675819575 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg=reconcileAgentPod namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Updated message  -> Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Marking workflow completed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Checking daemoned children of " namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Workflow to be dehydrated" Workflow Size=14619
time="2023-03-09T19:08:26.680Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/postman-test-5lrl7-entrypoint-2114579652/terminateContainers
time="2023-03-09T19:08:26.681Z" level=info msg="cleaning up pod" action=deletePod key=argo/postman-test-5lrl7-1340600742-agent/deletePod
time="2023-03-09T19:08:26.686Z" level=info msg="Delete pods 404"
time="2023-03-09T19:08:26.687Z" level=info msg="Create events 201"
time="2023-03-09T19:08:26.690Z" level=info msg="Update workflows 200"
time="2023-03-09T19:08:26.694Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=93135187 workflow=postman-test-5lrl7

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

JPZ13 · 2023-03-16T17:38:25Z

Hey @rajaie-sg - can you retest with #10523 in the latest commits? We think that it might be resolved from proper system call handling. Let us know the result

rajaie-sg · 2023-03-17T16:48:59Z

Hi @JPZ13 - that PR was already merged when I tested with latest, so I don't think it fixed the issue. It seems that PR has more to do with Pods that are already running, but in the scenario I described, we are stopping the Workflow before the Pod has been scheduled onto a node (still in Pending status).

rajaie-sg added the type/bug label Mar 9, 2023

JPZ13 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Mar 16, 2023

JPZ13 added the P3 Low priority label Mar 30, 2023

eucham mentioned this issue Aug 15, 2023

fix(resource): properly handle TERM signal after catch #11582

Merged

terrytangyuan closed this as completed in #11582 Aug 20, 2023

agilgur5 added area/executor and removed problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod continues running after Workflow is stopped #10658

Pod continues running after Workflow is stopped #10658

rajaie-sg commented Mar 9, 2023 •

edited by agilgur5

Loading

JPZ13 commented Mar 16, 2023

rajaie-sg commented Mar 17, 2023 •

edited

Loading

Pod continues running after Workflow is stopped #10658

Pod continues running after Workflow is stopped #10658

Comments

rajaie-sg commented Mar 9, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

JPZ13 commented Mar 16, 2023

rajaie-sg commented Mar 17, 2023 • edited Loading

rajaie-sg commented Mar 9, 2023 •

edited by agilgur5

Loading

rajaie-sg commented Mar 17, 2023 •

edited

Loading