Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

Closed
IreneGi opened this issue Dec 6, 2019 · 29 comments

Comments

@IreneGi
Copy link
Contributor

IreneGi commented Dec 6, 2019

What happened:
Tried to renun taxi pipeline, clicked on Logs and got:
Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption
What did you expect to happen:
To see Logs
What steps did you take:
[A clear and concise description of what the bug is.]
Tried to renun taxi pipeline, clicked on Logs
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

@neuromage
Copy link
Contributor

@IreneGi would you mind attaching a screenshot as well? Thanks!

@numerology
Copy link

@IreneGi is this issue transient or persisting after several attempts? Thanks.

@IreneGi
Copy link
Contributor Author

IreneGi commented Dec 6, 2019

It is transient but frequent (I've hit it a few times in the last 24 hours)

@IreneGi
Copy link
Contributor Author

IreneGi commented Dec 6, 2019

image

@rmgogogo
Copy link
Contributor

rmgogogo commented Dec 7, 2019

as replied in b/145802412, here I close this one.

@rmgogogo rmgogogo closed this as completed Dec 7, 2019
@mvk
Copy link

mvk commented Jan 26, 2020

hi, @rmgogogo , what does as replied in b/145802412 mean ?

@numerology
Copy link

hi, @rmgogogo , what does as replied in b/145802412 mean ?

Hi @mvk sorry it's a link to an internal doc. Anyway, this specific issue is supposed to be transient and GKE related.

@hsezhiyan
Copy link

@rmgogogo can you please post that reply in this thread so people can it directly?

@Bobgy
Copy link
Contributor

Bobgy commented Aug 27, 2020

The error message is expected, workflow's corresponding pods will be GCed after by default 1 day. Then you'll see this error message.

If on Google Cloud, you can still see the log in stackdriver.

@hsezhiyan
Copy link

@Bobgy make sense. However, sometimes I see the error Failed to retrieve pod logs. immediately after I create a run for only some pods in the pipeline (others are unaffected). Do you know what could be the error for this?

@numerology
Copy link

@hsezhiyan

One thing might be helpful to confirm is to check the node status in the GKE node pool. If the pod is associated to an unhealthy node (e.g. undergoing maintenance/upgrading) then it'll be temporarily unavailable.

Do you see this symptom persisting or it's transient?

@hsezhiyan
Copy link

@numerology

Thanks for your response! Actually, we don't use GKE, but we notice this symptom persists. For some reason, the affected pods are garbage collected but the healthy pods (which show healthy logs) are not. Any possible explanation for this behavior?

@numerology
Copy link

@hsezhiyan

Ummmm... I think this is more of a K8S behavior which can be tricky to diagnose from a pipeline perspective. @Bobgy do you have some suggestion? Or did you configure the GC policy of your k8s? https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/

@hsezhiyan
Copy link

@neuromage

Thanks again for the response. What I can say is that the workflow is not garbage collected, but certain pods within that workflow are... I'll check our GC policy now

@hsezhiyan
Copy link

hsezhiyan commented Aug 27, 2020

@neuromage

I've noticed that in the workflow, the pods with failed logging are of type: Retry whereas the functioning pods are of type: Pod. It seems like the KFP UI is not getting the logs from the retried Pods. Can you point me to where the UI pulls the logs from the pods?

@Bobgy
Copy link
Contributor

Bobgy commented Aug 28, 2020

selectedNodeDetails.phase !== NodePhase.SKIPPED && (

Yes, it might not be getting the correct retried pod.
PR welcomed

/Reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 28, 2020
@k8s-ci-robot
Copy link
Contributor

@Bobgy: Reopened this issue.

In response to this:

selectedNodeDetails.phase !== NodePhase.SKIPPED && (

Yes, it might not be getting the correct retried pod.
PR welcomed

/Reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@eterna2
Copy link
Contributor

eterna2 commented Sep 3, 2020

@Bobgy

I have investigated this and reported in #4425

This is an UI bug. Retry nodes are not actual physical pods. Any ops with retry limits will always have a Retry node + one or more Pod nodes.

Hence Retry nodes will not have pod info and pod logs. It may or may not have additional metadata (like artifacts, input, outputs) but these will be exactly the same as the actual Pod node.

eterna2 added a commit to e2forks/pipelines that referenced this issue Sep 8, 2020
…e virtual nodes with no physical counterpart - e.g. pod logs.
k8s-ci-robot pushed a commit that referenced this issue Sep 12, 2020
…l nodes. Fixes #4425 #2705 (#4474)

* Fix #4425 #2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs.

* Add unit test for filtering our virtual retry node
@stale
Copy link

stale bot commented Dec 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 4, 2020
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this issue Dec 9, 2020
…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474)

* Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs.

* Add unit test for filtering our virtual retry node
@lucinvitae
Copy link

Hello all,

Something like this issue (error: Failed to retrieve pod logs.) is frequently occurring for our cluster. We don't see Possible reasons include cluster autoscaling or pod preemption in the UI, but this typically happens for pipelines that depend on GPU nodes which are added to the cluster by our autoscaler.

We use S3 as a log backend, and can confirm that the logs are correctly populated in S3 despite the KFP UI error. So users of our KFP instance that don't know how to look up the log path directly in S3 will retry the run, since the user can't see the logs from the UI. Screenshots:

Screen Shot 2021-06-16 at 2 49 14 PM

This type of error occurs intermittently, and eventually affects all runs after the pods are cleaned up. It appears that the logs are correctly archived to S3 by Argo, and are accessible from the KFP UI before the Kubernetes pod is cleaned up, but after the Kubernetes pod is removed

We're using KFP (standalone) Version: 1.5.0

@Bobgy @eterna2 Do either of you know of a workaround so that we can have our KFP UI look up the logs in S3 even after the pod has been removed? Or, @Bobgy if you are looking for PRs to fix this, can you point me to the correct part of the codebase where we might be able to edit this logic and allow the KFP UI to use the S3 path that Argo archived the logs to?

cc @Jeffwan @PatrickXYS who Bogby mentioned work for AWS and may be interested since this affects S3 integration.

Thanks in advance, all. 🙏

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 17, 2021
@stale
Copy link

stale bot commented Oct 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 2, 2021
@speaknowpotato
Copy link

Hi @lucinvitae ,
our team also have the same issue, do you get it fixed?
thanks!

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 26, 2021
@lucinvitae
Copy link

Hi @lucinvitae , our team also have the same issue, do you get it fixed? thanks!

@speaknowpotato I'm not sure, but I haven't seen this issue in a while.

@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Apr 17, 2022
@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 17, 2023
@AlexandreBrown
Copy link

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS.
I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.

kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow

Cheers!

@MinhManPham
Copy link

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS. I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.

kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow

Cheers!

Perfectly worked, Thanks

@odovad
Copy link

odovad commented Aug 30, 2023

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS. I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.

kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow

Cheers!

Same for me; perfectly worked ! :)

@rimolive
Copy link
Member

Closing this issue. No updates for a year.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue. No updates for a year.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests