Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template with retry and exit hook stuck in Pending #13239

Open
3 of 4 tasks
McKMarcBruchner opened this issue Jun 24, 2024 · 11 comments
Open
3 of 4 tasks

Template with retry and exit hook stuck in Pending #13239

McKMarcBruchner opened this issue Jun 24, 2024 · 11 comments
Labels
area/exit-handler area/hooks area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@McKMarcBruchner
Copy link

McKMarcBruchner commented Jun 24, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Hi argo team,

I have two problems with the exact same workflow template, depending on the version I'm using.

In my workflow, I have template DAG task which has both a retry option (limit and policy doesn't seem to matter, it doesn't work with any setting) and a simple exit lifecycle hook. After that step, I have an additional step which should not be executed if the retry fails (either until the limit is reached or because the error does not match the retryPolicy).

Here are my issues:

  1. Versions up to v3.4.17 (latest v.3.4.X version), or up to v3.5.4:
    The workflow runs the task with the retry, failing because of some reason (in my example workflow by intention), then it either retries or not depending on the retryPolicy and the error type.
    Afterwards the hook runs as expected, but then the workflow seems to want to continue, the next step taking the exit hook step as its predecessor, thus skipping. The whole workflow is then marked as succeeded, while the retry step clearly failed. This is very unexpected behaviour to me!
    Screenshot 2024-06-24 at 08 45 21
    Screenshot 2024-06-24 at 08 46 21
    I would expect the workflow to fail directly after the exit hook of the failing step has been executed, as one of its steps has failed.

  2. Versions v3.5.5 up to v3.5.8 (currently latest version):
    I am not entirely sure, but I can imagine that this commit about retry (fix: retry node with expression status Running -> Pending #12637) from you to version v3.5.5 might have fixed the above issue, but now I run into a new problem that the exit hook step is stuck in pending mode, while I can clearly see in the Kubernetes cluster that the pod has been completed almost instantly.
    Screenshot 2024-06-24 at 09 30 02
    The attached logs further down of the workflow coordinator and the wait container of the exit hook pod (which shows the pod has been executed and completed) are displaying this error. The workflow coordinator keeps repeating the last 4 lines in a loop until I terminate or stop the workflow.

I kindly ask you to test both the issues above, maybe you have already fixed the first issue, but then the second one blocks me.

If you need further details, I'll be happy to share them.

Version

v3.4.17, v3.5.4, v3.5.5, v3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: argo-retry-hook-issue
  namespace: default
spec:
  templates:
    - name: start
      dag:
        tasks:
          - name: failing-retry-step
            template: failing-retry-step
            hooks:
              exit:
                template: exit-task
          - name: additional-step
            template: additional-step
            dependencies:
              - failing-retry-step
    - name: failing-retry-step
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - exit 1
      retryStrategy:
        limit: '1'
        retryPolicy: OnError
    - name: additional-step
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello second step
    - name: exit-task
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello exit task
  entrypoint: start

Logs from the workflow controller

time="2024-06-24T07:37:33.899Z" level=info msg="Processing workflow" Phase= ResourceVersion=3497948 namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.908Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2024-06-24T07:37:33.908Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2024-06-24T07:37:33.908Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.908Z" level=info msg="Updated phase  -> Running" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=info msg="Created PDB resource for workflow." namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=info msg="DAG node argo-retry-hook-issue-fpq2p initialized Running" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:33.911Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-3758027749, taskName failing-retry-step"
time="2024-06-24T07:37:33.911Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-3758027749, taskName failing-retry-step"
time="2024-06-24T07:37:33.911Z" level=info msg="All of node argo-retry-hook-issue-fpq2p.failing-retry-step dependencies ] completed" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=info msg="Retry node argo-retry-hook-issue-fpq2p-3758027749 initialized Running" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.911Z" level=info msg="Pod node argo-retry-hook-issue-fpq2p-2598440388 initialized Pending" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.917Z" level=info msg="Created pod: argo-retry-hook-issue-fpq2p.failing-retry-step(0) (argo-retry-hook-issue-fpq2p-failing-retry-step-2598440388)" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.917Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:33.917Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:33.917Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.917Z" level=info msg=reconcileAgentPod namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:33.918Z" level=info msg="Workflow to be dehydrated" Workflow Size=3177
time="2024-06-24T07:37:33.922Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3497957 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.919Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=3497957 namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.919Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="Pod failed: Error (exit code 1)" displayName="failing-retry-step(0)" namespace=default pod=argo-retry-hook-issue-fpq2p-failing-retry-step-2598440388 templateName=failing-retry-step workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="node changed" namespace=default new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=argo-retry-hook-issue-fpq2p-2598440388 old.message= old.phase=Pending old.progress=0/1 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:43.920Z" level=info msg="Retry Policy: OnError (onFailed: false, onError true)" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="Node not set to be retried after status: Failed" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="node argo-retry-hook-issue-fpq2p-3758027749 phase Running -> Failed" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="node argo-retry-hook-issue-fpq2p-3758027749 message: Error (exit code 1)" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="node argo-retry-hook-issue-fpq2p-3758027749 finished: 2024-06-24 07:37:43.920766667 +0000 UTC" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.920Z" level=info msg="Running OnExit handler" lifeCycleHook="&LifecycleHook{Template:exit-task,Arguments:Arguments{Parameters:]Parameter{},Artifacts:]Artifact{},},TemplateRef:nil,Expression:,}" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.921Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.921Z" level=info msg="Pod node argo-retry-hook-issue-fpq2p-1636730028 initialized Pending" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.926Z" level=info msg="Created pod: argo-retry-hook-issue-fpq2p.failing-retry-step.onExit (argo-retry-hook-issue-fpq2p-exit-task-1636730028)" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.926Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:43.926Z" level=warning msg="was unable to obtain the node for argo-retry-hook-issue-fpq2p-2456429657, taskName additional-step"
time="2024-06-24T07:37:43.926Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.926Z" level=info msg=reconcileAgentPod namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.926Z" level=info msg="Workflow to be dehydrated" Workflow Size=3921
time="2024-06-24T07:37:43.931Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=3497998 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:43.936Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/argo-retry-hook-issue-fpq2p-failing-retry-step-2598440388/labelPodCompleted
time="2024-06-24T07:37:53.928Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=3497998 namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:53.928Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:53.928Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=argo-retry-hook-issue-fpq2p-1636730028 old.message= old.phase=Pending old.progress=0/1 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:37:53.928Z" level=info msg="pod reconciliation didn't complete, will retry" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:03.930Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=3497998 namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:03.930Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:03.931Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=argo-retry-hook-issue-fpq2p-1636730028 old.message= old.phase=Pending old.progress=0/1 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:03.931Z" level=info msg="pod reconciliation didn't complete, will retry" namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:13.932Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=3497998 namespace=default workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:13.932Z" level=info msg="Task-result reconciliation" namespace=default numObjs=0 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:13.932Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=argo-retry-hook-issue-fpq2p-1636730028 old.message= old.phase=Pending old.progress=0/1 workflow=argo-retry-hook-issue-fpq2p
time="2024-06-24T07:38:13.932Z" level=info msg="pod reconciliation didn't complete, will retry" namespace=default workflow=argo-retry-hook-issue-fpq2p

Logs from in your workflow's wait container

time="2024-06-24T08:04:27.184Z" level=info msg="Starting Workflow Executor" version=v3.4.8
time="2024-06-24T08:04:27.185Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-06-24T08:04:27.185Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default podName=argo-retry-hook-issue-nc5ts-exit-task-2136563012 template="{\"name\":\"exit-task\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"main\",\"image\":\"argoproj/argosay:v2\",\"command\":[\"/argosay\"],\"args\":[\"echo\",\"Hello exit task\"],\"resources\":{}}}" version="&Version{Version:v3.4.8,BuildDate:2023-05-25T22:21:53Z,GitCommit:9e27baee4b3be78bb662ffa5e3a06f8a6c28fb53,GitTag:v3.4.8,GitTreeState:clean,GoVersion:go1.20.4,Compiler:gc,Platform:linux/arm64,}"
time="2024-06-24T08:04:27.185Z" level=info msg="Starting deadline monitor"
time="2024-06-24T08:04:29.187Z" level=info msg="Main container completed" error="<nil>"
time="2024-06-24T08:04:29.187Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-06-24T08:04:29.187Z" level=info msg="No output parameters"
time="2024-06-24T08:04:29.187Z" level=info msg="No output artifacts"
time="2024-06-24T08:04:29.187Z" level=info msg="Alloc=8539 TotalAlloc=15961 Sys=24173 NumGC=4 Goroutines=7"
time="2024-06-24T08:04:29.187Z" level=info msg="Deadline monitor stopped"
Stream closed EOF for default/argo-retry-hook-issue-nc5ts-exit-task-2136563012 (wait)
@agilgur5
Copy link
Member

agilgur5 commented Jun 24, 2024

  1. [...] I would expect the workflow to fail directly after the exit hook of the failing step has been executed, as one of its steps has failed.

This can certainly be confusing, but it is expected behavior -- the final node's exit code determines the success or failure of the overall Workflow. For example, think about the opposite use-case, "I want my Workflow to succeed if my finally logic succeeded" -- a user could then say that your use-case would be unexpected behavior as well. I'm fairly sure that other workflow tools, like GH Actions for instance, behave in the same way; the default is to behave like a finally or a single program -- the final exit code is what determines its overall success or failure.

As such, it is actually currently not possible to otherwise determine (without relying on the exit code) whether you want a successful final node to result in a failure -- you would need to be able to indicate your high-level intent somewhere. See #12530 for a feature request to add something to the spec to indicate this behavior separately from an exit code.

The user-land workaround there is to modify your exit code appropriately to match your intent.

Afterwards the hook runs as expected, but then the workflow seems to want to continue, the next step taking the exit hook step as its predecessor, thus skipping.

The visualization might not be the most intuitive in this case. There's a switch in the graph UI (the lightning bolt IIRC) for a slower algorithm that might(?) be more intuitive in this case as it could group nodes differently.

Without access to your completed Workflow's status, it's a bit hard to tell, but I would think the additional-step is actually Omitted rather than Skipped; it doesn't run since it's dependent task failed.
I'm not sure if it showing up as subsequent is an artifact of the display algorithm or because the exit handler might be considered grouped with is parent step -- I don't remember how the Controller handles that off the top of my head actually.

2. [...] but now I run into a new problem that the exit hook step is stuck in pending mode, while I can clearly see in the Kubernetes cluster that the pod has been completed almost instantly.

Yea that certainly sounds like a bug. A regression given that it actually worked as expected before. It sounds like a bug likely stemming from #12402's change in handling task results (we've had a few of those -- fixed one race condition, created more new race conditions 🙃). I'm not sure which of the changes in 3.5.5 would cause that though from a quick glance, especially as you appear to not be using any outputs or artifacts what-so-ever. Although your configuration would be helpful to see as well, in case you have any workflowDefaults or similar not visible in your provided Workflow spec.
The running Workflow's status would also show anything like that -- it probably has something corrupted in it.

If you need further details, I'll be happy to share them.

Your Controller ConfigMap and the YAML of the running Workflow's status would be helpful. Any connected workflowtaskresults to that Workflow as well.
Your repro is very simple though (thanks!), so we might be able to produce the latter two ourselves if unaffected by the ConfigMap. I haven't tried yet

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority labels Jun 24, 2024
@agilgur5 agilgur5 changed the title Template with retry and exit hook does not behave as expected on error Template with retry and exit hook stuck in Pending Jun 24, 2024
@McKMarcBruchner
Copy link
Author

Hi agilgur5!

Thank you very much for your detailed answer!

I think what confused me about part 1 was the behaviour of the workflow when I either remove the retry or the exit hook:

  • the same workflow without exit hook but with retry: the main workflow fails as soon as the retry is failing
    Screenshot 2024-06-25 at 09 21 33
  • the same workflow without retry but with exit hook: the main workflow fails as soon as the exit hook is finished (successfully). This is the most confusing part for me, especially as you can see in the graph that the next steps are then branching off from the actual failing step while the exit hook is just a side job.
    Screenshot 2024-06-25 at 09 22 40
    While when I have both retry and exit hook, the next steps are continuing from the exit hook job as the previous step.

But I think I understand now that this is intended that the exit hook becomes more or less a final step of the retry when both retry and exit hook are defined.

but I would think the additional-step is actually Omitted rather than Skipped

Yes you are correct, which makes total sense.

The user-land workaround there is to modify your exit code appropriately to match your intent.

Thank you, this solution is completely sufficient for us, it might confuse at first when debugging, but I think with a clear exit message it won't be much of a problem.

@McKMarcBruchner
Copy link
Author

McKMarcBruchner commented Jun 25, 2024

About part 2,

here is the status of the stuck workflow:
status:
  phase: Running
  startedAt: '2024-06-25T07:33:17Z'
  finishedAt: null
  estimatedDuration: 20
  progress: 0/2
  nodes:
    argo-retry-hook-issue-54bnq:
      id: argo-retry-hook-issue-54bnq
      name: argo-retry-hook-issue-54bnq
      displayName: argo-retry-hook-issue-54bnq
      type: DAG
      templateName: start
      templateScope: local/
      phase: Running
      startedAt: '2024-06-25T07:33:17Z'
      finishedAt: null
      estimatedDuration: 20
      progress: 0/2
      children:
        - argo-retry-hook-issue-54bnq-2742882792
    argo-retry-hook-issue-54bnq-2742882792:
      id: argo-retry-hook-issue-54bnq-2742882792
      name: argo-retry-hook-issue-54bnq.failing-retry-step
      displayName: failing-retry-step
      type: Retry
      templateName: failing-retry-step
      templateScope: local/
      phase: Failed
      boundaryID: argo-retry-hook-issue-54bnq
      message: Error (exit code 1)
      startedAt: '2024-06-25T07:33:17Z'
      finishedAt: '2024-06-25T07:33:27Z'
      estimatedDuration: 10
      progress: 0/2
      resourcesDuration:
        cpu: 0
        memory: 3
      outputs:
        exitCode: '1'
      children:
        - argo-retry-hook-issue-54bnq-352626555
        - argo-retry-hook-issue-54bnq-3765049435
    argo-retry-hook-issue-54bnq-352626555:
      id: argo-retry-hook-issue-54bnq-352626555
      name: argo-retry-hook-issue-54bnq.failing-retry-step(0)
      displayName: failing-retry-step(0)
      type: Pod
      templateName: failing-retry-step
      templateScope: local/
      phase: Failed
      boundaryID: argo-retry-hook-issue-54bnq
      message: Error (exit code 1)
      startedAt: '2024-06-25T07:33:17Z'
      finishedAt: '2024-06-25T07:33:20Z'
      estimatedDuration: 3
      progress: 0/1
      resourcesDuration:
        cpu: 0
        memory: 3
      nodeFlag:
        retried: true
      outputs:
        exitCode: '1'
      hostNodeName: docker-desktop
    argo-retry-hook-issue-54bnq-3765049435:
      id: argo-retry-hook-issue-54bnq-3765049435
      name: argo-retry-hook-issue-54bnq.failing-retry-step.onExit
      displayName: failing-retry-step.onExit
      type: Pod
      templateName: exit-task
      templateScope: local/
      phase: Pending
      boundaryID: argo-retry-hook-issue-54bnq
      startedAt: '2024-06-25T07:33:27Z'
      finishedAt: null
      estimatedDuration: 2
      progress: 0/1
      nodeFlag:
        hooked: true
  storedTemplates:
    namespaced/argo-retry-hook-issue/additional-step:
      name: additional-step
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello second step
        resources: {}
    namespaced/argo-retry-hook-issue/exit-task:
      name: exit-task
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello exit task
        resources: {}
    namespaced/argo-retry-hook-issue/failing-retry-step:
      name: failing-retry-step
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - exit 1
        resources: {}
      retryStrategy:
        limit: '1'
        retryPolicy: OnError
    namespaced/argo-retry-hook-issue/start:
      name: start
      inputs: {}
      outputs: {}
      metadata: {}
      dag:
        tasks:
          - name: failing-retry-step
            template: failing-retry-step
            arguments: {}
            hooks:
              exit:
                template: exit-task
                arguments: {}
          - name: additional-step
            template: additional-step
            arguments: {}
            dependencies:
              - failing-retry-step
  conditions:
    - type: PodRunning
      status: 'False'
  resourcesDuration:
    cpu: 0
    memory: 3
  storedWorkflowTemplateSpec:
    templates:
      - name: start
        inputs: {}
        outputs: {}
        metadata: {}
        dag:
          tasks:
            - name: failing-retry-step
              template: failing-retry-step
              arguments: {}
              hooks:
                exit:
                  template: exit-task
                  arguments: {}
            - name: additional-step
              template: additional-step
              arguments: {}
              dependencies:
                - failing-retry-step
      - name: failing-retry-step
        inputs: {}
        outputs: {}
        metadata: {}
        container:
          name: main
          image: argoproj/argosay:v2
          command:
            - /argosay
          args:
            - exit 1
          resources: {}
        retryStrategy:
          limit: '1'
          retryPolicy: OnError
      - name: additional-step
        inputs: {}
        outputs: {}
        metadata: {}
        container:
          name: main
          image: argoproj/argosay:v2
          command:
            - /argosay
          args:
            - echo
            - Hello second step
          resources: {}
      - name: exit-task
        inputs: {}
        outputs: {}
        metadata: {}
        container:
          name: main
          image: argoproj/argosay:v2
          command:
            - /argosay
          args:
            - echo
            - Hello exit task
          resources: {}
    entrypoint: start
    arguments: {}
    ttlStrategy:
      secondsAfterCompletion: 8640000
    podGC:
      strategy: OnPodSuccess
    podDisruptionBudget:
      minAvailable: 1
    workflowTemplateRef:
      name: argo-retry-hook-issue
    volumeClaimGC:
      strategy: OnWorkflowCompletion
    workflowMetadata:
      labels:
        example: 'true'
  artifactRepositoryRef:
    default: true
    artifactRepository: {}
  artifactGCStatus:
    notSpecified: true

in addition, here is the yaml of the configmap:

apiVersion: v1
data:
  config: |
    instanceID: xry
    workflowDefaults:
      spec:
        podDisruptionBudget:
          minAvailable: 1
        podGC:
          strategy: OnPodSuccess
        ttlStrategy:
          secondsAfterCompletion: 8640000
        volumeClaimGC:
          strategy: OnWorkflowCompletion
    nodeEvents:
      enabled: true
kind: ConfigMap
metadata:
  labels:
    helm.sh/chart: argo-workflows-0.30.0
  name: argo-workflows-workflow-controller-configmap
  namespace: default

I hope these settings will help to reproduce it, I will also play around a bit with the default settings in the config map.

@McKMarcBruchner
Copy link
Author

Mmmh even removing all workflowDefaults does not work, neither using steps instead of the dag approach.
My local K8s version is v1.29.1, maybe that helps to reproduce it, but since it is relatively new as well as the argo version, I don't think it is an issue.

@McKMarcBruchner
Copy link
Author

McKMarcBruchner commented Jun 25, 2024

One interesting thing I've noticed about part 1 of my question now while testing it with the step approach instead of the dag approach:

When I use steps in the main template instead of dag, the main workflow reacts as I would react on the retry-hook situation:
Screenshot 2024-06-25 at 09 59 14

metadata:
  name: argo-stuck-issue-steps
  namespace: default
spec:
  templates:
    - name: start
      steps:
        - - name: failing-retry-step
            template: failing-retry-step
            hooks:
              exit:
                template: exit-task
        - - name: additional-step
            template: additional-step
    - name: failing-retry-step
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - exit 1
      retryStrategy:
        limit: '1'
        retryPolicy: OnError
    - name: additional-step
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello second step
    - name: exit-task
      container:
        name: main
        image: argoproj/argosay:v2
        command:
          - /argosay
        args:
          - echo
          - Hello exit task
  entrypoint: start

The only difference to the workflow I've tried before is the start template using steps

      steps:
        - - name: failing-retry-step
            template: failing-retry-step
            hooks:
              exit:
                template: exit-task
        - - name: additional-step
            template: additional-step

compared to using dag as in the original workflow:

      dag:
        tasks:
          - name: failing-retry-step
            template: failing-retry-step
            hooks:
              exit:
                template: exit-task
          - name: additional-step
            template: additional-step
            dependencies:
              - failing-retry-step

Is it to be expected that the main start workflow is reacting differently to the retry-hook-combination? 🤔

@agilgur5
Copy link
Member

agilgur5 commented Jun 27, 2024

  • the same workflow without exit hook but with retry: the main workflow fails as soon as the retry is failing

This is just based on the last task's exit code, so without an exit hook it is more straightforward

  • This is the most confusing part for me, especially as you can see in the graph that the next steps are then branching off from the actual failing step while the exit hook is just a side job.

Again I'd recommend toggling the "lightning bolt" button to try the slower, alternative layout, which sometimes makes more sense.

In this case the fast layout (which uses a Coffman-Graham Sort) is just showing that those two tasks ran in parallel -- they both kicked off when the retry node completed.

When I use steps in the main template instead of dag, the main workflow reacts as I would react on the retry-hook situation:

🤔 I think this might be because steps have no Omitted state (or fastFail option) and have to use an explicit continueOn to get past any failure, so it might be calculated as a straight "did anything without continueOn fail?".
I don't actually remember why that differs without re-reading a bunch of code, but that explanation would make logical sense.
The DAG and steps implementations are different (although their code is very similar and has a good bit of duplication), with DAGs being newer and having more features, so there are times they diverge because of that. Me and a few other maintainers have been looking to reduce steps and some other mechanisms (like hooks) into more "pure" DAGs to simplify the codebase and reduce bugs per #12694. (also potentially looking to rewrite that core state machine/DAG code into a more advanced language like Rust where the compiler would catch many more edge cases; Workflows has a ton of permutations, so all of these kinds of systematic changes can help reduce that complexity or make it easier to maintain).

@McKMarcBruchner
Copy link
Author

McKMarcBruchner commented Jun 27, 2024

That's good to know that DAGs will be the future, as we're currently switching from steps to DAGs!

This is just based on the last task's exit code, so without an exit hook it is more straightforward

Okay, I think we are fine with the solution that the exit hook itself throws if the task status is failed, so that the main workflow also fails. Thanks for your help and explanation here!


Which only leaves the part 2, about v3.5.5 or higher 😬

Did you get a chance to test my settings? If you managed to run my workflow with those settings successfully on your end, then I will dig deeper into my settings, as argo itself would not be the problem.

@agilgur5
Copy link
Member

Did you get a chance to test my settings?

I haven't yet, still on the to-do list. But nothing seemed to catch my eye in your status. So more or less need to reproduce it to see what's going on; since I hadn't done that yet, I hadn't commented on that part yet.

@McKMarcBruchner
Copy link
Author

Hi @agilgur5 did you find time to check if you can reproduce my issue?

@jswxstw
Copy link
Member

jswxstw commented Sep 14, 2024

I can‘t reproduce that exit hook stuck in Pending, its behaviour is exactly same as the old versions:
QQ_1726302344516

The workflow phase is Succeeded not Failed which is not as expected. There is a bug in assessDAGPhase: DAG phase is returned as Succeeded because the exit hook node is Succeeded, which cannot be taken into account.

I have some questions about task-level hooks, consider the following scenario:

metadata:
  name: argo-retry-hook-issue
  namespace: argo
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: retry-step
            template: retry-step
            hooks:
              exit:
                template: fail
          - name: additional-step
            template: hello
            dependencies:
              - retry-step
    - name: retry-step
      container:
        image: argoproj/argosay:v2
        args: ["exit", "0"]
      retryStrategy:
        limit: 1
    - name: hello
      container:
        image: argoproj/argosay:v2
        args: ["echo", "hello"]
    - name: fail
      container:
        image: argoproj/argosay:v2
        args: ["exit", "1"]

QQ_1726302798180

Question: Should retry step be Failed since its exit hook failed?
ps. According to workflow-level exit hook implementation, workflow will be Failed if its exit hook failed.

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Sep 14, 2024
@agilgur5
Copy link
Member

agilgur5 commented Sep 14, 2024

I can‘t reproduce that exit hook stuck in Pending, its behaviour is exactly same as the old versions:

Yes this is probably fixed by one of the WorkflowTaskResults fixes on main, since, as I mentioned above, this is due to task result tracking regressions stemming from #12402.

Thanks @jswxstw for following up on all these stuck Workflows due to WorkflowTaskResults regressions!

Question: Should retry step be Failed since its exit hook failed?
ps. According to workflow-level exit hook implementation, workflow will be Failed if its exit hook failed.

It's a template level hook in that case, and in the middle of the template, so I think this behavior is correct: the last task's phase is used, unless there's an exit hook after the last task, in which case that hook's phase is used.

There is a bug in assessDAGPhase: DAG phase is returned as Succeeded because the exit hook node is Succeeded

Per your docs reference above and my same comment linked above, since the last hook is Succeeded, this is correct according to the current behavior. #12530 aims to make that behavior more configurable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/exit-handler area/hooks area/retryStrategy Template-level retryStrategy P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

3 participants