UI doesn't correctly return logs #4425

hsezhiyan · 2020-08-28T00:07:28Z

What steps did you take:

Made a simple KFP run on using an MNIST PyTorch training/evaluation script.

What happened:

When making a KFP run, occasionally the UI does not properly retrieve logs, and gives the error Failed to retrieve pod logs. By inspecting the Chrome browser network setting, it appears to be an API call issue with the following error: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object]..

What did you expect to happen:

All the steps in the pipeline UI should correctly return logs.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?
Deployed KFP on Amazon EKS infrastructure.

KFP version: v1.0.0

KFP SDK version: v1.0.0

/kind bug

The text was updated successfully, but these errors were encountered:

eterna2 · 2020-08-28T10:41:32Z

I will look at it. But the error msg means that when the UI backend tries to parse the CR of the workflow, the status field is malformed.

Based on my experience, usually it means that Argo had failed in such a way which the status event is not emitted or captured.

But let me see if I can reproduce it

hsezhiyan · 2020-08-28T16:37:29Z

Thank you @eedorenko, this is a closely related issue: #2705 (comment). Let me know if you need help reproducing the bug.

For more context, I'm attaching an image of our pipeline. One thing we noticed is that every step except for the first is duplicated. In each pair of duplicated steps, the second has correct logs but the first doesn't log correct. So in the image, Train model io works but train-model-io doesn't.

Furthermore, the steps that aren't logging correctly run in pods that have type: Retry, where correctly functioning pods have type: Pod.

Failing pod's spec (note the type: Retry at the bottom):

mnist-pytorch-example-zn7gp-399244972:
      boundaryID: mnist-pytorch-example-zn7gp
      children:
      - mnist-pytorch-example-zn7gp-1204333607
      displayName: evaluate-io
      finishedAt: "2020-08-27T19:27:42Z"
      id: mnist-pytorch-example-zn7gp-399244972
      inputs:
        parameters:
        - name: batch_size
          value: "64"
        - name: input_data_path
          value: mnist_pytorch_example/data
        - name: out_path
          value: /opt/zillow/shared/mnist_cnn.pth
        - name: test_accuracy_threshold
          value: "0.9"
        - name: test_batch_size
          value: "64"
        - name: train-model-io-create-shared-vol-name
          value: mnist-pytorch-example-zn7gp-shared-vol
        - name: train_accuracy_threshold
          value: "0.9"
      name: mnist-pytorch-example-zn7gp.evaluate-io
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      type: Retry

Working pod's spec (note the type: Pod at the bottom):

mnist-pytorch-example-zn7gp-1204333607:
      boundaryID: mnist-pytorch-example-zn7gp
      displayName: evaluate-io(0)
      finishedAt: "2020-08-27T19:27:41Z"
      id: mnist-pytorch-example-zn7gp-1204333607
      name: mnist-pytorch-example-zn7gp.evaluate-io(0)
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      templateName: evaluate-io
      type: Pod

eterna2 · 2020-09-01T01:25:22Z

@hsezhiyan

Is it possible for u to provide me with a dummy pipeline.yaml so I can reproduce it on my machine?

Did u use the set_retry method?

op.set_retry(2)

I was unable to reproduce on my machine. But I looked thru the codes, this particular error is raised when we are unable to retrieve the argo workflow CR properly.

Either it is a permission issue (unlikely if other pods works), or we inferred the workflow name wrongly (we just drop the last segment from the podname), or the wrong podname is passed to the backend.

Is it possible for u to share ur pipeline yaml, and if possible the actual API request which fails (the query strings - i.e. podname), and also the full workflow yaml?

hsezhiyan · 2020-09-01T19:00:14Z

Hello @eterna2 ,

Of course! Thank you for getting back to me. I didn't use the set_retry method.

Here is the pipeline yaml file. Here is the workflow yaml file .

Here are examples of failed API calls (retrieved using the inspect functionality on Chrome):
logs?podname=mnist-pytorch-example-f9prl-3877604630&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1107911560&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1507110387&podnamespace=hariharans

eterna2 · 2020-09-03T14:05:20Z

Ok I figured out the reasons for the errors. Although not too sure how can we solve this.

The reason for the error is because argo does not archive "Retry" pods at all. And it seems either the pod is never created or was gc the moment it errored out (before any logs can be archived to minio/s3).

I might need to write a small pod event monitor app and track whether the pod is created or not, and whether was it gc too quickly (so I can know is it an argo issue or a kfp issue).

eterna2 · 2020-09-03T14:30:41Z

Ok I did more investigation. The "Retry" pod was never created as I did not capture any pod events from it at all.

I think we might need to raise a ticket upstream to argo. I have no idea why this happens (creating a virtual pod?).

eterna2 · 2020-09-03T15:10:14Z

@Bobgy

I did more tests. This is an UI bug. When u set a retry limit for your ops, argo will always create a Retry node which is not an actual physical pod, hence this node does not have any logs or pod info.

And this node will always exist regardless if there is an actual retry or not.

So there are 2 solution to this issue. We do not render Retry nodes at all. Or we give Retry nodes a separate color scheme with no metadata at all (i.e. no artifacts, no logs, no pod info, etc) - as these info will be reflect in the actual Pod node.

Bobgy · 2020-09-03T16:13:20Z

Thank you for the detailed investigation!
Which solution Do you think is easier to implement?

I am still not fully understanding the topology here.

eterna2 · 2020-09-08T09:03:17Z

I have looked at the DAG graph rendering code. There is an existing function to filter our virtual nodes. I just need to add Retry as another virtual node.

I will be making a PR to reflect this.

This is the original DAG with the Retry node (2nd node).

My PR will remove this virtual node.

…e virtual nodes with no physical counterpart - e.g. pod logs.

…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474) * Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs. * Add unit test for filtering our virtual retry node

k8s-ci-robot added the kind/bug label Aug 28, 2020

Ark-kun assigned eterna2 and Bobgy Aug 28, 2020

Ark-kun added the area/frontend label Aug 28, 2020

eterna2 mentioned this issue Sep 3, 2020

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

Closed

eterna2 added a commit to e2forks/pipelines that referenced this issue Sep 8, 2020

Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they ar…

e5a7be0

…e virtual nodes with no physical counterpart - e.g. pod logs.

eterna2 mentioned this issue Sep 8, 2020

fix(frontend): do not render Retry nodes in UI DAG as they are virtual nodes. Fixes #4425 #2705 #4474

Merged

2 tasks

k8s-ci-robot closed this as completed in ca29b5c Sep 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UI doesn't correctly return logs #4425

UI doesn't correctly return logs #4425

hsezhiyan commented Aug 28, 2020 •

edited

Loading

eterna2 commented Aug 28, 2020 •

edited

Loading

hsezhiyan commented Aug 28, 2020 •

edited

Loading

eterna2 commented Sep 1, 2020 •

edited

Loading

hsezhiyan commented Sep 1, 2020 •

edited

Loading

eterna2 commented Sep 3, 2020

eterna2 commented Sep 3, 2020

eterna2 commented Sep 3, 2020

Bobgy commented Sep 3, 2020

eterna2 commented Sep 8, 2020

UI doesn't correctly return logs #4425

UI doesn't correctly return logs #4425

Comments

hsezhiyan commented Aug 28, 2020 • edited Loading

What steps did you take:

What happened:

What did you expect to happen:

Environment:

eterna2 commented Aug 28, 2020 • edited Loading

hsezhiyan commented Aug 28, 2020 • edited Loading

eterna2 commented Sep 1, 2020 • edited Loading

hsezhiyan commented Sep 1, 2020 • edited Loading

eterna2 commented Sep 3, 2020

eterna2 commented Sep 3, 2020

eterna2 commented Sep 3, 2020

Bobgy commented Sep 3, 2020

eterna2 commented Sep 8, 2020

hsezhiyan commented Aug 28, 2020 •

edited

Loading

eterna2 commented Aug 28, 2020 •

edited

Loading

hsezhiyan commented Aug 28, 2020 •

edited

Loading

eterna2 commented Sep 1, 2020 •

edited

Loading

hsezhiyan commented Sep 1, 2020 •

edited

Loading