Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with node selector does not retry only specified step #11279

Closed
2 of 3 tasks
sstaley-hioscar opened this issue Jun 29, 2023 · 11 comments
Closed
2 of 3 tasks

Retry with node selector does not retry only specified step #11279

sstaley-hioscar opened this issue Jun 29, 2023 · 11 comments
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/bug

Comments

@sstaley-hioscar
Copy link

sstaley-hioscar commented Jun 29, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

when I run a workflow with two parallel steps, I can't restart only one of them. If I run, for instance,

argo retry greeter-workflow-steps-dev-vx9zf  -n infra-compute  --node-field-selector="id=greeter-workflow-steps-dev-vx9zf-2583895212"

all failed nodes are restarted, despite my node field selector.

Am I running the command wrong?

Version

v3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

we use a helm chart that adds env vars to all containers, so I can only past the "spec" portion.

    spec:
    entrypoint: steps
    templates:
      - name: steps
        steps:
          - - name: step1
              inline:
                container:
                  image: alpine:3.7
                  env: []
                  args: [echo, 'hello inline']
            - name: step2
              template: template-with-input
              arguments:
                parameters: [{name: message, value: 'hello1'}]
            - name: step3
              template: fail
            - name: step4
              template: fail
      - name: container-only
        container:
          env:
            - name: 'KEY'
              value: 'VALUE2'
          image: alpine:3.7
          args: [ echo, '$KEY']
      - name: template-with-input
        inputs:
          parameters:
            - name: message
        container:
          image: alpine:3.7
          args: [echo, 'message is {{inputs.parameters.message}}']
      - name: fail
        container:
          image: alpine:3.7
          args: [exit 1]

or

spec:
    entrypoint: steps
    templates:
      - name: steps
        steps:
          - - name: step1
              inline:
                container:
                  image: alpine:3.7
                  env: []
                  args: [echo, 'hello inline']
            - name: step2
              template: fail
          - - name: step3
              template: template-with-input
              arguments:
                parameters: [{name: message, value: 'hello1'}]
            - name: step4
              template: fail
      - name: container-only
        container:
          env:
            - name: 'KEY'
              value: 'VALUE2'
          image: alpine:3.7
          args: [ echo, '$KEY']
      - name: template-with-input
        inputs:
          parameters:
            - name: message
        container:
          image: alpine:3.7
          args: [echo, 'message is {{inputs.parameters.message}}']
      - name: fail
        container:
          image: alpine:3.7
          args: [exit, '1']

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Found 2 pods, using pod/workflows-argo-workflows-workflow-controller-694df9c9b-mkh9t
time="2023-06-29T21:08:11.857Z" level=info msg="Processing workflow" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.896Z" level=info msg="Updated phase  -> Running" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.896Z" level=info msg="Steps node greeter-workflow-steps-dev-wcd48 initialized Running" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.896Z" level=info msg="StepGroup node greeter-workflow-steps-dev-wcd48-3982387348 initialized Running" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.896Z" level=info msg="Pod node greeter-workflow-steps-dev-wcd48-4236060887 initialized Pending" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.927Z" level=info msg="Created pod: greeter-workflow-steps-dev-wcd48[0].step1 (greeter-workflow-steps-dev-wcd48--4236060887)" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.927Z" level=info msg="Pod node greeter-workflow-steps-dev-wcd48-4252838506 initialized Pending" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.946Z" level=info msg="Created pod: greeter-workflow-steps-dev-wcd48[0].step2 (greeter-workflow-steps-dev-wcd48-template-with-input-4252838506)" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.946Z" level=info msg="Pod node greeter-workflow-steps-dev-wcd48-4269616125 initialized Pending" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.966Z" level=info msg="Created pod: greeter-workflow-steps-dev-wcd48[0].step3 (greeter-workflow-steps-dev-wcd48-fail-4269616125)" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.966Z" level=info msg="Pod node greeter-workflow-steps-dev-wcd48-4152172792 initialized Pending" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.986Z" level=info msg="Created pod: greeter-workflow-steps-dev-wcd48[0].step4 (greeter-workflow-steps-dev-wcd48-fail-4152172792)" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.986Z" level=info msg="Workflow step group node greeter-workflow-steps-dev-wcd48-3982387348 not yet completed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.986Z" level=info msg="TaskSet Reconciliation" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.986Z" level=info msg=reconcileAgentPod namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:11.999Z" level=info msg="Workflow update successful" namespace=infra-compute phase=Running resourceVersion=1112641463 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="Processing workflow" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="Task-result reconciliation" namespace=infra-compute numObjs=0 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="node changed" namespace=infra-compute new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4269616125 old.message= old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="node changed" namespace=infra-compute new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4236060887 old.message= old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="node changed" namespace=infra-compute new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4152172792 old.message= old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.927Z" level=info msg="node changed" namespace=infra-compute new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4252838506 old.message= old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.928Z" level=info msg="Workflow step group node greeter-workflow-steps-dev-wcd48-3982387348 not yet completed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.928Z" level=info msg="TaskSet Reconciliation" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.928Z" level=info msg=reconcileAgentPod namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:21.939Z" level=info msg="Workflow update successful" namespace=infra-compute phase=Running resourceVersion=1112641864 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="Processing workflow" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="Task-result reconciliation" namespace=infra-compute numObjs=0 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="node unchanged" namespace=infra-compute nodeID=greeter-workflow-steps-dev-wcd48-4152172792 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="node unchanged" namespace=infra-compute nodeID=greeter-workflow-steps-dev-wcd48-4252838506 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="node unchanged" namespace=infra-compute nodeID=greeter-workflow-steps-dev-wcd48-4236060887 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.940Z" level=info msg="node unchanged" namespace=infra-compute nodeID=greeter-workflow-steps-dev-wcd48-4269616125 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.941Z" level=info msg="Workflow step group node greeter-workflow-steps-dev-wcd48-3982387348 not yet completed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.941Z" level=info msg="TaskSet Reconciliation" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:31.941Z" level=info msg=reconcileAgentPod namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="Processing workflow" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="Task-result reconciliation" namespace=infra-compute numObjs=0 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="node changed" namespace=infra-compute new.message= new.phase=Succeeded new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4236060887 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="Pod failed: Error (exit code 64): failed to find name in PATH: exec: \"exit\": executable file not found in $PATH" displayName=step3 namespace=infra-compute pod=greeter-workflow-steps-dev-wcd48-fail-4269616125 templateName=fail workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="node changed" namespace=infra-compute new.message="Error (exit code 64): failed to find name in PATH: exec: \"exit\": executable file not found in $PATH" new.phase=Failed new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4269616125 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="node changed" namespace=infra-compute new.message= new.phase=Succeeded new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4252838506 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="Pod failed: Error (exit code 64): failed to find name in PATH: exec: \"exit\": executable file not found in $PATH" displayName=step4 namespace=infra-compute pod=greeter-workflow-steps-dev-wcd48-fail-4152172792 templateName=fail workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.168Z" level=info msg="node changed" namespace=infra-compute new.message="Error (exit code 64): failed to find name in PATH: exec: \"exit\": executable file not found in $PATH" new.phase=Failed new.progress=0/1 nodeID=greeter-workflow-steps-dev-wcd48-4152172792 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Step group node greeter-workflow-steps-dev-wcd48-3982387348 deemed failed: child 'greeter-workflow-steps-dev-wcd48-4269616125' failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48-3982387348 phase Running -> Failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48-3982387348 message: child 'greeter-workflow-steps-dev-wcd48-4269616125' failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48-3982387348 finished: 2023-06-29 21:08:56.169247735 +0000 UTC" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="step group greeter-workflow-steps-dev-wcd48-3982387348 was unsuccessful: child 'greeter-workflow-steps-dev-wcd48-4269616125' failed" namespace=infra-compute workflowgreeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Outbound nodes of greeter-workflow-steps-dev-wcd48-4236060887 is [greeter-workflow-steps-dev-wcd48-4236060887]" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Outbound nodes of greeter-workflow-steps-dev-wcd48-4252838506 is [greeter-workflow-steps-dev-wcd48-4252838506]" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Outbound nodes of greeter-workflow-steps-dev-wcd48-4269616125 is [greeter-workflow-steps-dev-wcd48-4269616125]" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Outbound nodes of greeter-workflow-steps-dev-wcd48-4152172792 is [greeter-workflow-steps-dev-wcd48-4152172792]" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Outbound nodes of greeter-workflow-steps-dev-wcd48 is [greeter-workflow-steps-dev-wcd48-4236060887 greeter-workflow-steps-dev-wcd48-4252838506 greeter-workflow-steps-dev-wcd48-4269616125 greeter-workflow-steps-dev-wcd48-4152172792]" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48 phase Running -> Failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48 message: child 'greeter-workflow-steps-dev-wcd48-4269616125' failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="node greeter-workflow-steps-dev-wcd48 finished: 2023-06-29 21:08:56.16977965 +0000 UTC" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Checking daemoned children of greeter-workflow-steps-dev-wcd48" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="TaskSet Reconciliation" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg=reconcileAgentPod namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.169Z" level=info msg="Updated phase Running -> Failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.170Z" level=info msg="Updated message  -> child 'greeter-workflow-steps-dev-wcd48-4269616125' failed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.170Z" level=info msg="Marking workflow completed" namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.170Z" level=info msg="Checking daemoned children of " namespace=infra-compute workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.175Z" level=info msg="cleaning up pod" action=deletePod key=infra-compute/greeter-workflow-steps-dev-wcd48-1340600742-agent/deletePod
time="2023-06-29T21:08:56.185Z" level=info msg="Workflow update successful" namespace=infra-compute phase=Failed resourceVersion=1112643153 workflow=greeter-workflow-steps-dev-wcd48
time="2023-06-29T21:08:56.186Z" level=info msg="Queueing Failed workflow infra-compute/greeter-workflow-steps-dev-wcd48 for delete in 5m0s due to TTL"
time="2023-06-29T21:09:01.195Z" level=info msg="cleaning up pod" action=deletePod key=infra-compute/greeter-workflow-steps-dev-wcd48-template-with-input-4252838506/deletePod
time="2023-06-29T21:09:01.195Z" level=info msg="cleaning up pod" action=deletePod key=infra-compute/greeter-workflow-steps-dev-wcd48-fail-4269616125/deletePod
time="2023-06-29T21:09:01.195Z" level=info msg="cleaning up pod" action=deletePod key=infra-compute/greeter-workflow-steps-dev-wcd48-fail-4152172792/deletePod
time="2023-06-29T21:09:01.195Z" level=info msg="cleaning up pod" action=deletePod key=infra-compute/greeter-workflow-steps-dev-wcd48--4236060887/deletePod

Logs from in your workflow's wait container

N/A
@spy-1234
Copy link

spy-1234 commented Jul 1, 2023

How to get involved in argo workflows for the contribution.

@terrytangyuan
Copy link
Member

Relevant code is here. Would anyone like to take a look? https://github.com/argoproj/argo-workflows/blob/master/workflow/util/util.go#L804

@tooptoop4
Copy link
Contributor

I think this relates to #10675 where the possible key/values for node selector are very hard to know

@sstaley-hioscar
Copy link
Author

sstaley-hioscar commented Jul 2, 2023

@tooptoop4 Are you saying it might just be an issue with my CLI command? If so, that would be great.

@sstaley-hioscar
Copy link
Author

If this is indeed a bug, I may be able to justify some time toward this in the next couple of months. We're evaluating Argo Workflows right now as a DAG solution. We still have a lot of work to do toward that end, but being able to retry only specific failed nodes is a requested feature of any solution we go with.

@tooptoop4
Copy link
Contributor

tooptoop4 commented Jul 3, 2023

the trouble is we don't have an enum of available values, so we could be sending argument of step(0).abc when it really expects step-0.template.abc or something else! we just don't know

@sarabala1979 sarabala1979 added the P3 Low priority label Jul 6, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@sstaley-hioscar

This comment was marked as resolved.

@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries and removed problem/stale This has not had a response in some time labels Sep 18, 2023
@tooptoop4
Copy link
Contributor

any luck @sstaley-hioscar ?

@sstaley-hioscar
Copy link
Author

sstaley-hioscar commented Sep 16, 2024

So far no one's requested the feature on our end

@agilgur5
Copy link
Member

Seems like this has been superseded by #12543 which has had a decent bit of activity

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/bug
Projects
None yet
Development

No branches or pull requests

6 participants