-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI doesn't correctly return logs #4425
Comments
I will look at it. But the error msg means that when the UI backend tries to parse the CR of the workflow, the status field is malformed. Based on my experience, usually it means that Argo had failed in such a way which the status event is not emitted or captured. But let me see if I can reproduce it |
Thank you @eedorenko, this is a closely related issue: #2705 (comment). Let me know if you need help reproducing the bug. For more context, I'm attaching an image of our pipeline. One thing we noticed is that every step except for the first is duplicated. In each pair of duplicated steps, the second has correct logs but the first doesn't log correct. So in the image, Furthermore, the steps that aren't logging correctly run in pods that have Failing pod's spec (note the
Working pod's spec (note the
|
Is it possible for u to provide me with a dummy pipeline.yaml so I can reproduce it on my machine? Did u use the op.set_retry(2) I was unable to reproduce on my machine. But I looked thru the codes, this particular error is raised when we are unable to retrieve the argo workflow CR properly. Either it is a permission issue (unlikely if other pods works), or we inferred the workflow name wrongly (we just drop the last segment from the podname), or the wrong podname is passed to the backend. Is it possible for u to share ur pipeline yaml, and if possible the actual API request which fails (the query strings - i.e. podname), and also the full workflow yaml? |
Hello @eterna2 , Of course! Thank you for getting back to me. I didn't use the Here is the pipeline yaml file. Here is the workflow yaml file . Here are examples of failed API calls (retrieved using the |
Ok I figured out the reasons for the errors. Although not too sure how can we solve this. The reason for the error is because argo does not archive "Retry" pods at all. And it seems either the pod is never created or was gc the moment it errored out (before any logs can be archived to minio/s3). I might need to write a small pod event monitor app and track whether the pod is created or not, and whether was it gc too quickly (so I can know is it an argo issue or a kfp issue). |
Ok I did more investigation. The "Retry" pod was never created as I did not capture any pod events from it at all. I think we might need to raise a ticket upstream to argo. I have no idea why this happens (creating a virtual pod?). |
I did more tests. This is an UI bug. When u set a retry limit for your ops, argo will always create a And this node will always exist regardless if there is an actual retry or not. So there are 2 solution to this issue. We do not render |
Thank you for the detailed investigation! I am still not fully understanding the topology here. |
…e virtual nodes with no physical counterpart - e.g. pod logs.
…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474) * Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs. * Add unit test for filtering our virtual retry node
What steps did you take:
Made a simple KFP run on using an MNIST PyTorch training/evaluation script.
What happened:
When making a KFP run, occasionally the UI does not properly retrieve logs, and gives the error
Failed to retrieve pod logs.
By inspecting the Chrome browser network setting, it appears to be an API call issue with the following error:Could not get main container logs: Error: Unable to retrieve workflow status: [object Object].
.What did you expect to happen:
All the steps in the pipeline UI should correctly return logs.
Environment:
How did you deploy Kubeflow Pipelines (KFP)?
Deployed KFP on Amazon EKS infrastructure.
KFP version:
v1.0.0
KFP SDK version:
v1.0.0
/kind bug
The text was updated successfully, but these errors were encountered: