Wrong Pod name in `argo get` command result from CLI #9906

xubofei1983 · 2022-10-25T18:36:00Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Run an example workflow https://github.com/argoproj/argo-workflows/blob/master/examples/retry-on-error.yaml

The Pod name in the result of argo get are wrong.

argo get retry-on-error-v2pk2 -n workflow
Name: retry-on-error-v2pk2
...

STEP TEMPLATE PODNAME DURATION MESSAGE
✖ retry-on-error-v2pk2 error-container No more retries left
├─⚠ retry-on-error-v2pk2(0) error-container retry-on-error-v2pk2-error-container-2869263017 26s Error (exit code 1): failed to put file: 404 Not Found
├─✖ retry-on-error-v2pk2(1) error-container retry-on-error-v2pk2-error-container-2427568992 4s Error (exit code 3)
└─✖ retry-on-error-v2pk2(2) error-container retry-on-error-v2pk2-error-container-816476283 4s Error (exit code 4)

kubectl get pods -n workflow
NAME READY STATUS RESTARTS AGE
retry-on-error-v2pk2-error-container-1195955417 0/2 Completed 0 6m17s
retry-on-error-v2pk2-error-container-1800096796 0/2 Error 0 5m41s
retry-on-error-v2pk2-error-container-3410203767 0/2 Error 0 5m31s

The UI works good
NAME
retry-on-error-v2pk2(0)
ID
retry-on-error-v2pk2-1195955417
POD NAME
retry-on-error-v2pk2-error-container-1195955417

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-on-error-
spec:
  entrypoint: error-container
  templates:
  - name: error-container
    retryStrategy:
      limit: "2"
      retryPolicy: "Always"   # Retry on errors AND failures. Also available: "OnFailure" (default), "OnError", and "OnTransientError" (retry only on transient errors such as i/o or TLS handshake timeout. Available after v3.0.0-rc2)
    container:
      image: python
      command: ["python", "-c"]
      # fail with a 80% probability
      args: ["import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)"]

Logs from the workflow controller

Not related

Logs from in your workflow's wait container

Not related

The text was updated successfully, but these errors were encountered:

814HiManny · 2022-10-31T04:19:39Z

I am having the same issue here. The pod names when performing argo get don't correspond to actual pod names used in the cluster.

sarabala1979 · 2022-10-31T17:44:50Z

@JPZ13 @rohankmr414 Can you take a look?

ognjen-it · 2022-11-03T14:13:03Z

I am having the same issue here 👍🏻

JPZ13 · 2022-11-03T14:24:03Z

I'm OOO this week @sarabala1979. How's your capacity @rohankmr414 or @isubasinghe?

mweibel · 2022-11-04T12:10:22Z

I have the same issue here. This is happening since version 3.4.0 (I unfortunately only upgraded this week and to 3.4.3 directly but traced it back to 3.4.0).

The case seems to only happen when a retry strategy is set. The hello-world.yaml example does not suffer from the same issue.

» k get po
NAME                                             READY   STATUS    RESTARTS   AGE
retry-on-error-khzpg-error-container-550301540   2/2     Running   0          4s

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-04T12:08:12Z"
  generateName: retry-on-error-
  generation: 2
  labels:
    workflows.argoproj.io/phase: Running
  name: retry-on-error-khzpg
  namespace: default
  resourceVersion: "16229"
  uid: 1d4e6dc4-e2be-475c-9c32-f3aaaef1cdf1
spec:
  arguments: {}
  entrypoint: error-container
  templates:
  - container:
      args:
      - import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)
      command:
      - python
      - -c
      image: python
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: error-container
    outputs: {}
    retryStrategy:
      limit: "2"
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  finishedAt: null
  nodes:
    retry-on-error-khzpg:
      children:
      - retry-on-error-khzpg-550301540
      displayName: retry-on-error-khzpg
      finishedAt: null
      id: retry-on-error-khzpg
      name: retry-on-error-khzpg
      phase: Running
      progress: 0/1
      startedAt: "2022-11-04T12:08:12Z"
      templateName: error-container
      templateScope: local/retry-on-error-khzpg
      type: Retry
    retry-on-error-khzpg-550301540:
      displayName: retry-on-error-khzpg(0)
      finishedAt: null
      id: retry-on-error-khzpg-550301540
      name: retry-on-error-khzpg(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-04T12:08:12Z"
      templateName: error-container
      templateScope: local/retry-on-error-khzpg
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-04T12:08:12Z"

It might be related to #6712 and #8748 but I'm not sure why it only happens for retry enabled workflows.

FWIW: Pretty important for us, since we gather data based on the status of workflows and we can't match them to pods right now.

mweibel · 2022-11-04T12:21:12Z

I believe the retry strategy to be relevant because of
https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/operator.go#L1730

And I believe the nodeID in status does get calculated wrongly here:
https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/operator.go#L2266

Since I'm not sure what the proper course of action is to fix this, I won't create a PR for it.

isubasinghe · 2022-11-04T12:28:59Z

@JPZ13 Should be able to handle it first thing Monday. @sarabala1979 feel free to assign me if that timeline is okay with you

isubasinghe · 2022-11-07T11:02:19Z

commit cc9d14c introduces the bug I believe or rather makes the bug appear (this could just be the canary in the coal mine), I checked this with git bisect. Am working on a fix.
Not really sure if this is a side effect of something else or not yet, will update as I make progress.

This is interesting because the json output from argo get is correct. But I don't think it is a formatting issue, something funky is going on I think. I say this because if I submit a workflow on :latest and then checkout a previous commit, it still will display the output in an incorrect manner.

…rgoproj#9995)

mweibel · 2022-11-25T11:42:26Z

@isubasinghe @terrytangyuan unfortunately there is still a bug with the workflow status.

example workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nodename-
spec:
  arguments: {}
  entrypoint: render
  templates:
    - inputs: {}
      metadata: {}
      name: render
      steps:
        - - arguments:
              parameters:
                - name: frames
                  value: '{{item.frames}}'
            name: run-blender
            template: blender
            withItems:
              - frames: 1
    - container:
        image: argoproj/argosay:v2
        command: ["/bin/sh", "-c"]
        args:
          - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
        name: ""
      inputs:
        parameters:
          - name: frames
      name: blender
      retryStrategy:
        limit: 2
        retryPolicy: Always

yields the following status:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-25T11:33:41Z"
  generateName: nodename-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: nodename-bvd45
  namespace: argo
  resourceVersion: "15649"
  uid: ea233eef-210d-4394-a238-ef847b104458
spec:
  activeDeadlineSeconds: 300
  arguments: {}
  entrypoint: render
  podSpecPatch: |
    terminationGracePeriodSeconds: 3
  templates:
  - inputs: {}
    metadata: {}
    name: render
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: frames
            value: '{{item.frames}}'
        name: run-blender
        template: blender
        withItems:
        - frames: 1
  - container:
      args:
      - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
        echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
      command:
      - /bin/sh
      - -c
      image: argoproj/argosay:v2
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: frames
    metadata: {}
    name: blender
    outputs: {}
    retryStrategy:
      limit: 2
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: my-minio-cred
        bucket: my-bucket
        endpoint: minio:9000
        insecure: true
        secretKeySecret:
          key: secretkey
          name: my-minio-cred
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    nodename-bvd45:
      children:
      - nodename-bvd45-701773242
      displayName: nodename-bvd45
      finishedAt: null
      id: nodename-bvd45
      name: nodename-bvd45
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: render
      templateScope: local/nodename-bvd45
      type: Steps
    nodename-bvd45-701773242:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3728066428
      displayName: '[0]'
      finishedAt: null
      id: nodename-bvd45-701773242
      name: nodename-bvd45[0]
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateScope: local/nodename-bvd45
      type: StepGroup
    nodename-bvd45-3728066428:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3928099255
      displayName: run-blender(0:frames:1)
      finishedAt: null
      id: nodename-bvd45-3728066428
      inputs:
        parameters:
        - name: frames
          value: "1"
      name: nodename-bvd45[0].run-blender(0:frames:1)
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Retry
    nodename-bvd45-3928099255:
      boundaryID: nodename-bvd45
      displayName: run-blender(0:frames:1)(0)
      finishedAt: null
      hostNodeName: k3d-argowf-server-0
      id: nodename-bvd45-3928099255
      inputs:
        parameters:
        - name: frames
          value: "1"
      message: PodInitializing
      name: nodename-bvd45[0].run-blender(0:frames:1)(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-25T11:33:41Z"

The pod is named nodename-bvd45-blender-3928099255 but the node ID is without the template name.

Can you please reopen or should I create a new issue?

isubasinghe · 2022-11-25T11:53:48Z

@mweibel could you please tell me what the desired pod name should be?
The PR addressed the case where a wrong number was generated when pretty printing only.
Is this the yaml output of "argo get"?

I have strong suspicions this is a controller/operator issue and different to the issue initially created, which was formatting based.

If so this issue is distinct from the original issue, is it better to create a new issue to keep them atomic?

mweibel · 2022-11-25T13:02:55Z

Yeah I suspected that the issue at hand is because the argo workflow status doesn't contain the right node IDs and that's why the CLI is unable to access it. I'll create a new issue with the details.

mweibel · 2022-11-25T13:11:00Z

See #10107

xubofei1983 added the type/bug label Oct 25, 2022

sarabala1979 added type/regression Regression from previous behavior (a specific type of bug) P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Oct 31, 2022

isubasinghe mentioned this issue Nov 8, 2022

fix: use correct node name as args to PodName. Fixes #9906 #9995

Merged

8 tasks

terrytangyuan closed this as completed in #9995 Nov 10, 2022

terrytangyuan pushed a commit that referenced this issue Nov 10, 2022

fix: use correct node name as args to PodName. Fixes #9906 (#9995)

67bcdb5

mweibel pushed a commit to helio/argo-workflows that referenced this issue Nov 24, 2022

fix: use correct node name as args to PodName. Fixes argoproj#9906 (a…

9e0ac23

…rgoproj#9995)

mweibel mentioned this issue Nov 25, 2022

Invalid node IDs in workflow.status #10107

Open

3 tasks

SlaterByte mentioned this issue Jun 21, 2023

argo get displays incorrect PODNAME when Node has a templateRef #11250

Closed

3 tasks

terrytangyuan mentioned this issue Jun 22, 2023

Ensure single source of truth for generated values. #10267

Open

agilgur5 added the area/cli The `argo` CLI label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong Pod name in `argo get` command result from CLI #9906

Wrong Pod name in `argo get` command result from CLI #9906

xubofei1983 commented Oct 25, 2022

814HiManny commented Oct 31, 2022 •

edited

Loading

sarabala1979 commented Oct 31, 2022

ognjen-it commented Nov 3, 2022

JPZ13 commented Nov 3, 2022

mweibel commented Nov 4, 2022

mweibel commented Nov 4, 2022

isubasinghe commented Nov 4, 2022 •

edited

Loading

isubasinghe commented Nov 7, 2022 •

edited

Loading

mweibel commented Nov 25, 2022

isubasinghe commented Nov 25, 2022 •

edited

Loading

mweibel commented Nov 25, 2022

mweibel commented Nov 25, 2022

Wrong Pod name in argo get command result from CLI #9906

Wrong Pod name in argo get command result from CLI #9906

Comments

xubofei1983 commented Oct 25, 2022

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

814HiManny commented Oct 31, 2022 • edited Loading

sarabala1979 commented Oct 31, 2022

ognjen-it commented Nov 3, 2022

JPZ13 commented Nov 3, 2022

mweibel commented Nov 4, 2022

mweibel commented Nov 4, 2022

isubasinghe commented Nov 4, 2022 • edited Loading

isubasinghe commented Nov 7, 2022 • edited Loading

mweibel commented Nov 25, 2022

isubasinghe commented Nov 25, 2022 • edited Loading

mweibel commented Nov 25, 2022

mweibel commented Nov 25, 2022

Wrong Pod name in `argo get` command result from CLI #9906

Wrong Pod name in `argo get` command result from CLI #9906

814HiManny commented Oct 31, 2022 •

edited

Loading

isubasinghe commented Nov 4, 2022 •

edited

Loading

isubasinghe commented Nov 7, 2022 •

edited

Loading

isubasinghe commented Nov 25, 2022 •

edited

Loading