Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod continues running after Workflow is stopped #10658

Closed
2 of 3 tasks
rajaie-sg opened this issue Mar 9, 2023 · 2 comments · Fixed by #11582
Closed
2 of 3 tasks

Pod continues running after Workflow is stopped #10658

rajaie-sg opened this issue Mar 9, 2023 · 2 comments · Fixed by #11582

Comments

@rajaie-sg
Copy link

rajaie-sg commented Mar 9, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We have cluster-autoscaler installed on our Kubernetes cluster to automatically scale up nodes when none are available.

If I start a Workflow which runs a Pod, and then stop that workflow before the Pod starts running (so, while the Pod is in a Pending state, waiting to be scheduled onto a node), the Workflow stops fine but the Pod continues running until all the containers in it complete.

It takes a few minutes for the autoscaler to bring up a new node, so if someone triggers a workflow, then aborts it in the few minutes it takes to scale up the node and schedule the pod, then the pod may end up in a zombie state where it's running while the Workflow has already been stopped.

Version

3.4.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow that triggers a Pod should work. I triggered a workflow `postman-test-5lrl7`

Logs from the workflow controller

time="2023-03-09T19:08:26.671Z" level=info msg="Processing workflow" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node unchanged" namespace=argo nodeID=postman-test-5lrl7-2114579652 workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=argo podName=postman-test-5lrl7-entrypoint-2114579652 shutdownStrategy=Stop workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 phase Pending -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 message: workflow shutdown with strategy:  Stop" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2114579652 finished: 2023-03-09 19:08:26.6743069 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 phase Pending -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 message: workflow shutdown with strategy:  Stop" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.674Z" level=info msg="node postman-test-5lrl7-2054235633 finished: 2023-03-09 19:08:26.674350294 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 phase Running -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 message: Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.675Z" level=info msg="node postman-test-5lrl7 finished: 2023-03-09 19:08:26.675819575 +0000 UTC" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg=reconcileAgentPod namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Updated message  -> Stopped with strategy 'Stop'" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Marking workflow completed" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Checking daemoned children of " namespace=argo workflow=postman-test-5lrl7
time="2023-03-09T19:08:26.676Z" level=info msg="Workflow to be dehydrated" Workflow Size=14619
time="2023-03-09T19:08:26.680Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/postman-test-5lrl7-entrypoint-2114579652/terminateContainers
time="2023-03-09T19:08:26.681Z" level=info msg="cleaning up pod" action=deletePod key=argo/postman-test-5lrl7-1340600742-agent/deletePod
time="2023-03-09T19:08:26.686Z" level=info msg="Delete pods 404"
time="2023-03-09T19:08:26.687Z" level=info msg="Create events 201"
time="2023-03-09T19:08:26.690Z" level=info msg="Update workflows 200"
time="2023-03-09T19:08:26.694Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=93135187 workflow=postman-test-5lrl7

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@JPZ13
Copy link
Member

JPZ13 commented Mar 16, 2023

Hey @rajaie-sg - can you retest with #10523 in the latest commits? We think that it might be resolved from proper system call handling. Let us know the result

@JPZ13 JPZ13 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Mar 16, 2023
@rajaie-sg
Copy link
Author

rajaie-sg commented Mar 17, 2023

Hi @JPZ13 - that PR was already merged when I tested with latest, so I don't think it fixed the issue. It seems that PR has more to do with Pods that are already running, but in the scenario I described, we are stopping the Workflow before the Pod has been scheduled onto a node (still in Pending status).

@JPZ13 JPZ13 added the P3 Low priority label Mar 30, 2023
@agilgur5 agilgur5 added area/executor and removed problem/more information needed Not enough information has been provide to diagnose this issue. labels Oct 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants