Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow marked as finished before template level onExit hooks finished when stop workflow #11880

Closed
2 of 3 tasks
toyamagu-2021 opened this issue Sep 24, 2023 · 5 comments · Fixed by #12436
Closed
2 of 3 tasks
Labels
area/exit-handler area/hooks area/shutdown Shutdown Strategy: Stop and Terminate P3 Low priority type/bug

Comments

@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Sep 24, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Condition "to be done" met after 9s
Checking expectation stop-terminate-wpvxk
stop-terminate-wpvxk : Failed Stopped with strategy 'Stop'
    signals_test.go:44: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:44
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:68
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:43
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/signals_test.go:36
        	Error:      	Not equal: 
        	            	expected: "Succeeded"
        	            	actual  : "Running"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1alpha1.NodePhase) (len=9) "Succeeded"
        	            	+(v1alpha1.NodePhase) (len=7) "Running"
        	            	 
        	Test:       	TestSignalsSuite/TestStopBehavior
=== FAIL: SignalsSuite/TestStopBehavior

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: stop-terminate-
spec:
  entrypoint: main
  onExit: exit
  templates:
    - name: main
      dag:
        tasks:
          - name: A
            template: echo
            onExit: exit-template

    - name: echo
      container:
        image: argoproj/argosay:v1
        command: [ sleep ]
        args: [ "999" ]

    - name: exit
      container:
        image: argoproj/argosay:v1
        command: [ sleep ]
        args: [ "10" ]

    - name: exit-template
      container:
        image: argoproj/argosay:v1
        command: [ sleep ]
        args: [ "20" ]

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@toyamagu-2021 toyamagu-2021 changed the title Workflow marked as finished before tempalte level onExit hooks fisnished Workflow marked as finished before tempalte level onExit hooks fisnished when stop workflow Sep 24, 2023
@agilgur5
Copy link
Member

agilgur5 commented Sep 24, 2023

Wow nice find and good hypothesis! I didn't realize there were two hooks on that Workflow, but that could make a lot of sense if that is root cause. Fairly edge case edge case.

If the controller has marked it as stopped and moved on, then the NodeStatus never gets updated 🤔

@toyamagu-2021
Copy link
Member Author

toyamagu-2021 commented Sep 24, 2023

Yeah. Truely edge case which might not problem for our users but for us suffered from flakiness :)

For workflow-level hooks, controller will wait for onExit NodeStatus:

But will not wait for template-level hooks of dag?:

@skytt
Copy link

skytt commented Dec 20, 2023

same questions. stopped workflow was archived with dag onExit hook running. And the archived workflow cannot retry again because of the running exit hook. Is there any fix processing?

@toyamagu-2021
Copy link
Member Author

toyamagu-2021 commented Dec 20, 2023

I'm afraid there is no fix going on currently.

@toyamagu-2021
Copy link
Member Author

Thanks for sharing your use-case.
Can we get over this by checking NodeStatuses recursively? I might know the related logic, so I'll look into it.

hittingray pushed a commit to atlassian-forks/argo-workflows that referenced this issue Jan 3, 2024
@agilgur5 agilgur5 changed the title Workflow marked as finished before tempalte level onExit hooks fisnished when stop workflow Workflow marked as finished before template level onExit hooks finished when stop workflow Mar 9, 2024
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this issue May 6, 2024
…11880) (argoproj#12436)

Signed-off-by: Isitha Subasinghe <isubasinghe@student.unimelb.edu.au>
@agilgur5 agilgur5 added the area/shutdown Shutdown Strategy: Stop and Terminate label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/exit-handler area/hooks area/shutdown Shutdown Strategy: Stop and Terminate P3 Low priority type/bug
Projects
None yet
3 participants