-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705
Comments
@IreneGi would you mind attaching a screenshot as well? Thanks! |
@IreneGi is this issue transient or persisting after several attempts? Thanks. |
It is transient but frequent (I've hit it a few times in the last 24 hours) |
as replied in b/145802412, here I close this one. |
hi, @rmgogogo , what does |
@rmgogogo can you please post that reply in this thread so people can it directly? |
The error message is expected, workflow's corresponding pods will be GCed after by default 1 day. Then you'll see this error message. If on Google Cloud, you can still see the log in stackdriver. |
@Bobgy make sense. However, sometimes I see the error |
One thing might be helpful to confirm is to check the node status in the GKE node pool. If the pod is associated to an unhealthy node (e.g. undergoing maintenance/upgrading) then it'll be temporarily unavailable. Do you see this symptom persisting or it's transient? |
Thanks for your response! Actually, we don't use GKE, but we notice this symptom persists. For some reason, the affected pods are garbage collected but the healthy pods (which show healthy logs) are not. Any possible explanation for this behavior? |
Ummmm... I think this is more of a K8S behavior which can be tricky to diagnose from a pipeline perspective. @Bobgy do you have some suggestion? Or did you configure the GC policy of your k8s? https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/ |
Thanks again for the response. What I can say is that the |
I've noticed that in the |
pipelines/frontend/src/pages/RunDetails.tsx Line 478 in 22b7b99
Yes, it might not be getting the correct retried pod. /Reopen |
@Bobgy: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I have investigated this and reported in #4425 This is an UI bug. Hence |
…e virtual nodes with no physical counterpart - e.g. pod logs.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474) * Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs. * Add unit test for filtering our virtual retry node
Hello all, Something like this issue (error: We use S3 as a log backend, and can confirm that the logs are correctly populated in S3 despite the KFP UI error. So users of our KFP instance that don't know how to look up the log path directly in S3 will retry the run, since the user can't see the logs from the UI. Screenshots: This type of error occurs intermittently, and eventually affects all runs after the pods are cleaned up. It appears that the logs are correctly archived to S3 by Argo, and are accessible from the KFP UI before the Kubernetes pod is cleaned up, but after the Kubernetes pod is removed We're using KFP (standalone) Version: 1.5.0 @Bobgy @eterna2 Do either of you know of a workaround so that we can have our KFP UI look up the logs in S3 even after the pod has been removed? Or, @Bobgy if you are looking for PRs to fix this, can you point me to the correct part of the codebase where we might be able to edit this logic and allow the KFP UI to use the S3 path that Argo archived the logs to? cc @Jeffwan @PatrickXYS who Bogby mentioned work for AWS and may be interested since this affects S3 integration. Thanks in advance, all. 🙏 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @lucinvitae , |
@speaknowpotato I'm not sure, but I haven't seen this issue in a while. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello, I got the same issue today running Kubeflow 1.4.1 on AWS. kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow Cheers! |
Perfectly worked, Thanks |
Same for me; perfectly worked ! :) |
Closing this issue. No updates for a year. /close |
@rimolive: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Tried to renun taxi pipeline, clicked on Logs and got:
Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption
What did you expect to happen:
To see Logs
What steps did you take:
[A clear and concise description of what the bug is.]
Tried to renun taxi pipeline, clicked on Logs
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The text was updated successfully, but these errors were encountered: