Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid node IDs in workflow.status #10107

Open
2 of 3 tasks
mweibel opened this issue Nov 25, 2022 · 23 comments
Open
2 of 3 tasks

Invalid node IDs in workflow.status #10107

mweibel opened this issue Nov 25, 2022 · 23 comments
Labels
P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@mweibel
Copy link
Contributor

mweibel commented Nov 25, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There's a disconnect between NodeIDs and pod names which wasn't the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

This is a follow-up of #9906. I initially thought this would be the same case but apparently is not.

The below workflow triggers the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-25T11:33:41Z"
  generateName: nodename-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: nodename-bvd45
  namespace: argo
  resourceVersion: "15649"
  uid: ea233eef-210d-4394-a238-ef847b104458
spec:
  activeDeadlineSeconds: 300
  arguments: {}
  entrypoint: render
  podSpecPatch: |
    terminationGracePeriodSeconds: 3
  templates:
  - inputs: {}
    metadata: {}
    name: render
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: frames
            value: '{{item.frames}}'
        name: run-blender
        template: blender
        withItems:
        - frames: 1
  - container:
      args:
      - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
        echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
      command:
      - /bin/sh
      - -c
      image: argoproj/argosay:v2
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: frames
    metadata: {}
    name: blender
    outputs: {}
    retryStrategy:
      limit: 2
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: my-minio-cred
        bucket: my-bucket
        endpoint: minio:9000
        insecure: true
        secretKeySecret:
          key: secretkey
          name: my-minio-cred
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    nodename-bvd45:
      children:
      - nodename-bvd45-701773242
      displayName: nodename-bvd45
      finishedAt: null
      id: nodename-bvd45
      name: nodename-bvd45
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: render
      templateScope: local/nodename-bvd45
      type: Steps
    nodename-bvd45-701773242:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3728066428
      displayName: '[0]'
      finishedAt: null
      id: nodename-bvd45-701773242
      name: nodename-bvd45[0]
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateScope: local/nodename-bvd45
      type: StepGroup
    nodename-bvd45-3728066428:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3928099255
      displayName: run-blender(0:frames:1)
      finishedAt: null
      id: nodename-bvd45-3728066428
      inputs:
        parameters:
        - name: frames
          value: "1"
      name: nodename-bvd45[0].run-blender(0:frames:1)
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Retry
    nodename-bvd45-3928099255:
      boundaryID: nodename-bvd45
      displayName: run-blender(0:frames:1)(0)
      finishedAt: null
      hostNodeName: k3d-argowf-server-0
      id: nodename-bvd45-3928099255
      inputs:
        parameters:
        - name: frames
          value: "1"
      message: PodInitializing
      name: nodename-bvd45[0].run-blender(0:frames:1)(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-25T11:33:41Z"

The pod to run is named nodename-bvd45-blender-3928099255 but the NodeID in workflow.status.nodes is just nodename-bvd45-3928099255.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nodename-
spec:
  arguments: {}
  entrypoint: render
  templates:
    - inputs: {}
      metadata: {}
      name: render
      steps:
        - - arguments:
              parameters:
                - name: frames
                  value: '{{item.frames}}'
            name: run-blender
            template: blender
            withItems:
              - frames: 1
    - container:
        image: argoproj/argosay:v2
        command: ["/bin/sh", "-c"]
        args:
          - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
        name: ""
      inputs:
        parameters:
          - name: frames
      name: blender
      retryStrategy:
        limit: 2
        retryPolicy: Always

Logs from the workflow controller

irrelevant

Logs from in your workflow's wait container

irrelevant
@mweibel mweibel added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Nov 25, 2022
@ognjen-it
Copy link

I have the same error :)
Just finger up for the issue 👍🏻

@alexec
Copy link
Contributor

alexec commented Nov 28, 2022

Node IDs are opaque identifiers, not pod names. Not a bug.

@ognjen-it
Copy link

ognjen-it commented Nov 28, 2022

Hi @alexec, thank you for your message. I didn't read this issue properly. He wrote about nodeID.
However, there is a problem with podName.
The real pod name is ognjen-587tr-template-1243882462 but podName in workflow.status is ognjen-587tr-1243882462. It's 100% bug.
I don't know if I need to create a new issue.

@alexec
Copy link
Contributor

alexec commented Nov 28, 2022

Ah. That is a bug. Can you raise a separate issue?

@ognjen-it
Copy link

@alexec #10124 done :)

@caelan-io
Copy link
Member

Closed per @alexec comment:

Node IDs are opaque identifiers, not pod names. Not a bug.

@chensun
Copy link

chensun commented May 5, 2023

Node IDs are opaque identifiers, not pod names. Not a bug.

@alexec We have been relying on node id being equal to pod name for the past few years until we recently try to upgrade from v3.3.8 to v3.4.7 and hit this issue. So this is a breaking change at minimum. Is there any reason why we can't or shouldn't use pod name as node id?

Also, if I'm not mistaken, workflow.status would contain pod name only when there's a failure case. If node id is different from pod name, then how can we get the pod name for non-failure case?

Thanks in advance!

@chensun
Copy link

chensun commented May 5, 2023

Since #8748 pod names use the V2 naming which contains the template name in the pod name. However the implementation did not update the Workflow.Status.Nodes map to contain the correct pod name anymore. There's a disconnect between NodeIDs and pod names which wasn't the case before. This makes it impossible to look at a argo workflow status, take the nodeID and use it to know which pod it belongs to.

+1 on what @mweibel said. This describes the issue precisely. This regression is blocking us from upgrading to v3.4.

@terrytangyuan
Copy link
Member

Let's reopen this since this blocks upgrade for KFP.

@terrytangyuan terrytangyuan reopened this May 5, 2023
@terrytangyuan
Copy link
Member

terrytangyuan commented May 9, 2023

@chensun To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName. For example:

nodeId: coinflip-tmrzc-3636679105
templateName: flip-coin
podName: coinflip-tmrzc-flip-coin-3636679105

This is basically how we get podName in the UI:

export const getPodName = (workflowName: string, nodeName: string, templateName: string, nodeID: string, version: string): string => {
    if (version === POD_NAME_V2 && templateName !== '') {
        if (workflowName === nodeName) {
            return workflowName;
        }

        const prefix = ensurePodNamePrefixLength(`${workflowName}-${templateName}`);

        const hash = createFNVHash(nodeName);
        return `${prefix}-${hash}`;
    }

    return nodeID;
};

and here's the podName implementation for failure case:

func GeneratePodName(workflowName, nodeName, templateName, nodeID string, version PodNameVersion) string {
	if version == PodNameV1 {
		return nodeID
	}

	if workflowName == nodeName {
		return workflowName
	}

	prefix := workflowName
	if !strings.Contains(nodeName, ".inline") {
		prefix = fmt.Sprintf("%s-%s", workflowName, templateName)
	}
	prefix = ensurePodNamePrefixLength(prefix)

	h := fnv.New32a()
	_, _ = h.Write([]byte(nodeName))

	return fmt.Sprintf("%s-%v", prefix, h.Sum32())

}

@terrytangyuan
Copy link
Member

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

@chensun
Copy link

chensun commented May 9, 2023

@terrytangyuan Thank your for your reply.

To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName

This is rather hacky, and subject to break with future upgrade. The way argo construct pod name is an implementation detail that no user should ever reimplement it themselves. In fact, even within Argo's code space, it should be a single SOT for pod name generation. There has been a chain of bugs following the v2 pod naming change. Please consider #10267.

This is basically how we get podName in the UI:

In fact, there's possibly an open bug with the code you pasted. See #11015 and the proposed fix. Again to the point of having a single SOT for pod name.

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

Can anyone explain why node IDs cannot be the same as pod names (which was the case before v3.4)? Node IDs can an arbitrary UUIDs if you want, but being able to know the k8s pod name regardless failure or not is a super important feature. Missing pod name is a regression and an undocumented breaking change that should be avoided with minor version bumps.

@terrytangyuan
Copy link
Member

Yes I agree it's hacky and could only be a temporary workaround to unblock your release.

cc @isubasinghe @JPZ13 @juliev0 who contributed the feature and relevant fixes. Any thoughts?

@juliev0
Copy link
Contributor

juliev0 commented May 10, 2023

Yes I agree it's hacky and could only be a temporary workaround to unblock your release.

cc @isubasinghe @JPZ13 @juliev0 who contributed the feature and relevant fixes. Any thoughts?

Hey Terry. I'm not sure exactly the question, but I believe JP put in the original work for updating to POD NAMES V2 and has been looking at many of the bugs for that since then. (Although, the very first issue I worked on was switching to V2 as the default, and then also fixed an issue that was related to this afterward. I actually never really was told the motivation for the change or anything.)

@terrytangyuan
Copy link
Member

I don't recall being involved in that conversation either. Perhaps @alexec or @sarabala1979 have more context around the motivation behind the change?

@JPZ13 JPZ13 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label May 11, 2023
@JPZ13
Copy link
Member

JPZ13 commented May 11, 2023

Chiming in - the motivation for switching from node ids to pod names with the workflow template in the pod name was so that a user who is looking through their individual pods using kubectl get pods can quickly figure out which pod corresponds to which step in a workflow. Most end users and infrastructure users really like this feature per the comment on #7279:

It's just better, especially for tracing, debugging

I'll confirm in the codebase, but if you don't want the pod names to display with the workflowtemplate and you need the pod names to just be the node id, you can set the POD_NAMES environment variable to v1 and have it go back to node id

cc @terrytangyuan @chensun

@JPZ13
Copy link
Member

JPZ13 commented May 11, 2023

Following up, the code path for v1 pod names (node ids) still exists: https://github.com/argoproj/argo-workflows/blob/master/workflow/util/pod_name.go#L38

@chensun
Copy link

chensun commented May 15, 2023

@JPZ13 Thanks for your reply. I can see the value of adding more info to pod name to make it more debug-friend. I wasn't questioning the change to pod name itself, but why node id cannot follow the same?

I can try the POD_NAMES env variable. But given the number of bugs and fixes around this topic, I'm not very confident that changing the environment variable won't break other cases. Please do consider #10267.

@JPZ13
Copy link
Member

JPZ13 commented May 15, 2023

I very much agree with #10267

@stale

This comment was marked as resolved.

@alexec

This comment was marked as resolved.

@stale stale bot removed the problem/stale This has not had a response in some time label Jul 11, 2023
@aaron-arellano
Copy link

aaron-arellano commented Aug 21, 2023

We use Argo heavily at ZG and when upgrading to v3.4 of argo workflows we noticed breaking changes this causes outside of the nodeId in the workflow. This v2 naming convention also breaks upstream k8s HOSTNAME env variable for the pod. For instance, we get the HOSTNAME in the workflow pod and run the kubernetes api call to get_namespace_pod with that HOSTNAME the pod name returned from the kubernetes api server does not match the pod name in the actual pod metadata.name. Not sure if there is some weird concatenation going on that is not persisting to etcd but the downard API does not match what is in etcd. I reverted the env var POD_NAMES to v1 and everything works in v3.4. I feel with all the bugs this v2 pod name should be reverted becuase the scope of breaking changes persists beyond argo itself and into kubernetes.

related to #10267

@juliev0 juliev0 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Sep 7, 2023
@henrywangx
Copy link
Contributor

@terrytangyuan Thank your for your reply.

To unblock your upgrade, you can construct/infer the podName from node.ID and node.templateName

This is rather hacky, and subject to break with future upgrade. The way argo construct pod name is an implementation detail that no user should ever reimplement it themselves. In fact, even within Argo's code space, it should be a single SOT for pod name generation. There has been a chain of bugs following the v2 pod naming change. Please consider #10267.

This is basically how we get podName in the UI:

In fact, there's possibly an open bug with the code you pasted. See #11015 and the proposed fix. Again to the point of having a single SOT for pod name.

I don't think you should rely on nodeId as Alex mentioned earlier:

Node IDs are opaque identifiers, not pod names. Not a bug.

Instead, it might be a good enhancement proposal to also add podName to node statuses for non-failure cases.

Can anyone explain why node IDs cannot be the same as pod names (which was the case before v3.4)? Node IDs can an arbitrary UUIDs if you want, but being able to know the k8s pod name regardless failure or not is a super important feature. Missing pod name is a regression and an undocumented breaking change that should be avoided with minor version bumps.

Agree with it. If we change pod'name with v2, we can't get the pod name from status. It's really inconvenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests