Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI doesn't correctly return logs #4425

Closed
hsezhiyan opened this issue Aug 28, 2020 · 9 comments
Closed

UI doesn't correctly return logs #4425

hsezhiyan opened this issue Aug 28, 2020 · 9 comments

Comments

@hsezhiyan
Copy link

hsezhiyan commented Aug 28, 2020

What steps did you take:

Made a simple KFP run on using an MNIST PyTorch training/evaluation script.

What happened:

When making a KFP run, occasionally the UI does not properly retrieve logs, and gives the error Failed to retrieve pod logs. By inspecting the Chrome browser network setting, it appears to be an API call issue with the following error: Could not get main container logs: Error: Unable to retrieve workflow status: [object Object]..

What did you expect to happen:

All the steps in the pipeline UI should correctly return logs.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?
Deployed KFP on Amazon EKS infrastructure.

KFP version: v1.0.0

KFP SDK version: v1.0.0

/kind bug

@eterna2
Copy link
Contributor

eterna2 commented Aug 28, 2020

I will look at it. But the error msg means that when the UI backend tries to parse the CR of the workflow, the status field is malformed.

Based on my experience, usually it means that Argo had failed in such a way which the status event is not emitted or captured.

But let me see if I can reproduce it

@hsezhiyan
Copy link
Author

hsezhiyan commented Aug 28, 2020

Thank you @eedorenko, this is a closely related issue: #2705 (comment). Let me know if you need help reproducing the bug.

For more context, I'm attaching an image of our pipeline. One thing we noticed is that every step except for the first is duplicated. In each pair of duplicated steps, the second has correct logs but the first doesn't log correct. So in the image, Train model io works but train-model-io doesn't.

Furthermore, the steps that aren't logging correctly run in pods that have type: Retry, where correctly functioning pods have type: Pod.

Failing pod's spec (note the type: Retry at the bottom):

mnist-pytorch-example-zn7gp-399244972:
      boundaryID: mnist-pytorch-example-zn7gp
      children:
      - mnist-pytorch-example-zn7gp-1204333607
      displayName: evaluate-io
      finishedAt: "2020-08-27T19:27:42Z"
      id: mnist-pytorch-example-zn7gp-399244972
      inputs:
        parameters:
        - name: batch_size
          value: "64"
        - name: input_data_path
          value: mnist_pytorch_example/data
        - name: out_path
          value: /opt/zillow/shared/mnist_cnn.pth
        - name: test_accuracy_threshold
          value: "0.9"
        - name: test_batch_size
          value: "64"
        - name: train-model-io-create-shared-vol-name
          value: mnist-pytorch-example-zn7gp-shared-vol
        - name: train_accuracy_threshold
          value: "0.9"
      name: mnist-pytorch-example-zn7gp.evaluate-io
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      type: Retry

Working pod's spec (note the type: Pod at the bottom):

mnist-pytorch-example-zn7gp-1204333607:
      boundaryID: mnist-pytorch-example-zn7gp
      displayName: evaluate-io(0)
      finishedAt: "2020-08-27T19:27:41Z"
      id: mnist-pytorch-example-zn7gp-1204333607
      name: mnist-pytorch-example-zn7gp.evaluate-io(0)
      outputs:
        artifacts:
        - name: mlpipeline-ui-metadata
          optional: true
          path: /tmp/outputs/mlpipeline_ui_metadata/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-ui-metadata.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
        - name: mlpipeline-metrics
          optional: true
          path: /tmp/outputs/mlpipeline_metrics/data
          s3:
            accessKeySecret:
              key: accesskey
              name: mlpipeline-minio-artifact
            bucket: mlpipeline
            endpoint: minio-service.kubeflow:9000
            insecure: true
            key: artifacts/mnist-pytorch-example-zn7gp/mnist-pytorch-example-zn7gp-1204333607/mlpipeline-metrics.tgz
            secretKeySecret:
              key: secretkey
              name: mlpipeline-minio-artifact
      phase: Succeeded
      startedAt: "2020-08-27T19:26:20Z"
      templateName: evaluate-io
      type: Pod

Screen Shot 2020-08-28 at 10 27 22 AM

@eterna2
Copy link
Contributor

eterna2 commented Sep 1, 2020

@hsezhiyan

Is it possible for u to provide me with a dummy pipeline.yaml so I can reproduce it on my machine?

Did u use the set_retry method?

op.set_retry(2)

I was unable to reproduce on my machine. But I looked thru the codes, this particular error is raised when we are unable to retrieve the argo workflow CR properly.

Either it is a permission issue (unlikely if other pods works), or we inferred the workflow name wrongly (we just drop the last segment from the podname), or the wrong podname is passed to the backend.

Is it possible for u to share ur pipeline yaml, and if possible the actual API request which fails (the query strings - i.e. podname), and also the full workflow yaml?

@hsezhiyan
Copy link
Author

hsezhiyan commented Sep 1, 2020

Hello @eterna2 ,

Of course! Thank you for getting back to me. I didn't use the set_retry method.

Here is the pipeline yaml file. Here is the workflow yaml file .

Here are examples of failed API calls (retrieved using the inspect functionality on Chrome):
logs?podname=mnist-pytorch-example-f9prl-3877604630&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1107911560&podnamespace=hariharans
logs?podname=mnist-pytorch-example-f9prl-1507110387&podnamespace=hariharans

@eterna2
Copy link
Contributor

eterna2 commented Sep 3, 2020

Ok I figured out the reasons for the errors. Although not too sure how can we solve this.

The reason for the error is because argo does not archive "Retry" pods at all. And it seems either the pod is never created or was gc the moment it errored out (before any logs can be archived to minio/s3).

I might need to write a small pod event monitor app and track whether the pod is created or not, and whether was it gc too quickly (so I can know is it an argo issue or a kfp issue).

@eterna2
Copy link
Contributor

eterna2 commented Sep 3, 2020

Ok I did more investigation. The "Retry" pod was never created as I did not capture any pod events from it at all.

I think we might need to raise a ticket upstream to argo. I have no idea why this happens (creating a virtual pod?).

@eterna2
Copy link
Contributor

eterna2 commented Sep 3, 2020

@Bobgy

I did more tests. This is an UI bug. When u set a retry limit for your ops, argo will always create a Retry node which is not an actual physical pod, hence this node does not have any logs or pod info.

And this node will always exist regardless if there is an actual retry or not.

So there are 2 solution to this issue. We do not render Retry nodes at all. Or we give Retry nodes a separate color scheme with no metadata at all (i.e. no artifacts, no logs, no pod info, etc) - as these info will be reflect in the actual Pod node.

@Bobgy
Copy link
Contributor

Bobgy commented Sep 3, 2020

Thank you for the detailed investigation!
Which solution Do you think is easier to implement?

I am still not fully understanding the topology here.

@eterna2
Copy link
Contributor

eterna2 commented Sep 8, 2020

I have looked at the DAG graph rendering code. There is an existing function to filter our virtual nodes. I just need to add Retry as another virtual node.

I will be making a PR to reflect this.

This is the original DAG with the Retry node (2nd node).
image

My PR will remove this virtual node.
image

eterna2 added a commit to e2forks/pipelines that referenced this issue Sep 8, 2020
…e virtual nodes with no physical counterpart - e.g. pod logs.
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474)

* Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs.

* Add unit test for filtering our virtual retry node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants