Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

IreneGi · 2019-12-06T18:49:02Z

What happened:
Tried to renun taxi pipeline, clicked on Logs and got:
Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption
What did you expect to happen:
To see Logs
What steps did you take:
[A clear and concise description of what the bug is.]
Tried to renun taxi pipeline, clicked on Logs
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

neuromage · 2019-12-06T19:04:08Z

@IreneGi would you mind attaching a screenshot as well? Thanks!

numerology · 2019-12-06T19:05:18Z

@IreneGi is this issue transient or persisting after several attempts? Thanks.

IreneGi · 2019-12-06T22:46:56Z

It is transient but frequent (I've hit it a few times in the last 24 hours)

IreneGi · 2019-12-06T22:48:34Z

rmgogogo · 2019-12-07T07:01:21Z

as replied in b/145802412, here I close this one.

mvk · 2020-01-26T10:58:23Z

hi, @rmgogogo , what does as replied in b/145802412 mean ?

numerology · 2020-01-26T23:32:33Z

hi, @rmgogogo , what does as replied in b/145802412 mean ?

Hi @mvk sorry it's a link to an internal doc. Anyway, this specific issue is supposed to be transient and GKE related.

hsezhiyan · 2020-08-26T21:04:43Z

@rmgogogo can you please post that reply in this thread so people can it directly?

Bobgy · 2020-08-27T14:00:52Z

The error message is expected, workflow's corresponding pods will be GCed after by default 1 day. Then you'll see this error message.

If on Google Cloud, you can still see the log in stackdriver.

hsezhiyan · 2020-08-27T17:39:25Z

@Bobgy make sense. However, sometimes I see the error Failed to retrieve pod logs. immediately after I create a run for only some pods in the pipeline (others are unaffected). Do you know what could be the error for this?

numerology · 2020-08-27T17:41:42Z

@hsezhiyan

One thing might be helpful to confirm is to check the node status in the GKE node pool. If the pod is associated to an unhealthy node (e.g. undergoing maintenance/upgrading) then it'll be temporarily unavailable.

Do you see this symptom persisting or it's transient?

hsezhiyan · 2020-08-27T17:52:18Z

@numerology

Thanks for your response! Actually, we don't use GKE, but we notice this symptom persists. For some reason, the affected pods are garbage collected but the healthy pods (which show healthy logs) are not. Any possible explanation for this behavior?

numerology · 2020-08-27T17:56:32Z

@hsezhiyan

Ummmm... I think this is more of a K8S behavior which can be tricky to diagnose from a pipeline perspective. @Bobgy do you have some suggestion? Or did you configure the GC policy of your k8s? https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/

hsezhiyan · 2020-08-27T18:10:23Z

@neuromage

Thanks again for the response. What I can say is that the workflow is not garbage collected, but certain pods within that workflow are... I'll check our GC policy now

hsezhiyan · 2020-08-27T22:04:41Z

@neuromage

I've noticed that in the workflow, the pods with failed logging are of type: Retry whereas the functioning pods are of type: Pod. It seems like the KFP UI is not getting the logs from the retried Pods. Can you point me to where the UI pulls the logs from the pods?

Bobgy · 2020-08-28T01:53:44Z

pipelines/frontend/src/pages/RunDetails.tsx

Line 478 in 22b7b99

selectedNodeDetails.phase !== NodePhase.SKIPPED && (

Yes, it might not be getting the correct retried pod.
PR welcomed

/Reopen

k8s-ci-robot · 2020-08-28T01:53:49Z

@Bobgy: Reopened this issue.

In response to this:

pipelines/frontend/src/pages/RunDetails.tsx

Line 478 in 22b7b99

selectedNodeDetails.phase !== NodePhase.SKIPPED && (

Yes, it might not be getting the correct retried pod.
PR welcomed

/Reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

eterna2 · 2020-09-03T15:13:22Z

@Bobgy

I have investigated this and reported in #4425

This is an UI bug. Retry nodes are not actual physical pods. Any ops with retry limits will always have a Retry node + one or more Pod nodes.

Hence Retry nodes will not have pod info and pod logs. It may or may not have additional metadata (like artifacts, input, outputs) but these will be exactly the same as the actual Pod node.

…e virtual nodes with no physical counterpart - e.g. pod logs.

…l nodes. Fixes #4425 #2705 (#4474) * Fix #4425 #2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs. * Add unit test for filtering our virtual retry node

stale · 2020-12-04T22:05:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

…l nodes. Fixes kubeflow#4425 kubeflow#2705 (kubeflow#4474) * Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they are virtual nodes with no physical counterpart - e.g. pod logs. * Add unit test for filtering our virtual retry node

lucinvitae · 2021-06-17T13:57:38Z

Hello all,

Something like this issue (error: Failed to retrieve pod logs.) is frequently occurring for our cluster. We don't see Possible reasons include cluster autoscaling or pod preemption in the UI, but this typically happens for pipelines that depend on GPU nodes which are added to the cluster by our autoscaler.

We use S3 as a log backend, and can confirm that the logs are correctly populated in S3 despite the KFP UI error. So users of our KFP instance that don't know how to look up the log path directly in S3 will retry the run, since the user can't see the logs from the UI. Screenshots:

This type of error occurs intermittently, and eventually affects all runs after the pods are cleaned up. It appears that the logs are correctly archived to S3 by Argo, and are accessible from the KFP UI before the Kubernetes pod is cleaned up, but after the Kubernetes pod is removed

We're using KFP (standalone) Version: 1.5.0

@Bobgy @eterna2 Do either of you know of a workaround so that we can have our KFP UI look up the logs in S3 even after the pod has been removed? Or, @Bobgy if you are looking for PRs to fix this, can you point me to the correct part of the codebase where we might be able to edit this logic and allow the KFP UI to use the S3 path that Argo archived the logs to?

cc @Jeffwan @PatrickXYS who Bogby mentioned work for AWS and may be interested since this affects S3 integration.

Thanks in advance, all. 🙏

stale · 2021-10-02T01:03:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

speaknowpotato · 2021-12-26T19:50:43Z

Hi @lucinvitae ,
our team also have the same issue, do you get it fixed?
thanks!

lucinvitae · 2021-12-27T19:05:54Z

Hi @lucinvitae , our team also have the same issue, do you get it fixed? thanks!

@speaknowpotato I'm not sure, but I haven't seen this issue in a while.

stale · 2022-04-17T06:27:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AlexandreBrown · 2023-01-18T19:11:06Z

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS.
I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.

kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow

Cheers!

MinhManPham · 2023-02-14T02:31:00Z

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS. I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.
kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow
Cheers!

Perfectly worked, Thanks

odovad · 2023-08-30T09:07:52Z

Hello, I got the same issue today running Kubeflow 1.4.1 on AWS. I fixed this issue by restarting the ml-pipeline-ui deployment in the kubeflow namespace.
kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow
Cheers!

Same for me; perfectly worked ! :)

rimolive · 2024-04-21T21:16:01Z

Closing this issue. No updates for a year.

/close

google-oss-prow · 2024-04-21T21:16:05Z

@rimolive: Closing this issue.

In response to this:

Closing this issue. No updates for a year.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rmgogogo closed this as completed Dec 7, 2019

k8s-ci-robot reopened this Aug 28, 2020

hsezhiyan mentioned this issue Aug 28, 2020

UI doesn't correctly return logs #4425

Closed

eterna2 added a commit to e2forks/pipelines that referenced this issue Sep 8, 2020

Fix kubeflow#4425 kubeflow#2705: do not render Retry nodes as they ar…

e5a7be0

…e virtual nodes with no physical counterpart - e.g. pod logs.

eterna2 mentioned this issue Sep 8, 2020

fix(frontend): do not render Retry nodes in UI DAG as they are virtual nodes. Fixes #4425 #2705 #4474

Merged

2 tasks

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 4, 2020

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 17, 2021

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 2, 2021

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Dec 26, 2021

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Apr 17, 2022

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 17, 2023

google-oss-prow bot closed this as completed Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

IreneGi commented Dec 6, 2019

neuromage commented Dec 6, 2019

numerology commented Dec 6, 2019

IreneGi commented Dec 6, 2019

IreneGi commented Dec 6, 2019

rmgogogo commented Dec 7, 2019

mvk commented Jan 26, 2020

numerology commented Jan 26, 2020

hsezhiyan commented Aug 26, 2020

Bobgy commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

numerology commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

numerology commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020 •

edited

Loading

Bobgy commented Aug 28, 2020

k8s-ci-robot commented Aug 28, 2020

eterna2 commented Sep 3, 2020

stale bot commented Dec 4, 2020

lucinvitae commented Jun 17, 2021

stale bot commented Oct 2, 2021

speaknowpotato commented Dec 26, 2021

lucinvitae commented Dec 27, 2021

stale bot commented Apr 17, 2022

AlexandreBrown commented Jan 18, 2023

MinhManPham commented Feb 14, 2023

odovad commented Aug 30, 2023

rimolive commented Apr 21, 2024

google-oss-prow bot commented Apr 21, 2024

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

Warning: failed to retrieve pod logs. Possible reasons include cluster autoscaling or pod preemption #2705

Comments

IreneGi commented Dec 6, 2019

neuromage commented Dec 6, 2019

numerology commented Dec 6, 2019

IreneGi commented Dec 6, 2019

IreneGi commented Dec 6, 2019

rmgogogo commented Dec 7, 2019

mvk commented Jan 26, 2020

numerology commented Jan 26, 2020

hsezhiyan commented Aug 26, 2020

Bobgy commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

numerology commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

numerology commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020

hsezhiyan commented Aug 27, 2020 • edited Loading

Bobgy commented Aug 28, 2020

k8s-ci-robot commented Aug 28, 2020

eterna2 commented Sep 3, 2020

stale bot commented Dec 4, 2020

lucinvitae commented Jun 17, 2021

stale bot commented Oct 2, 2021

speaknowpotato commented Dec 26, 2021

lucinvitae commented Dec 27, 2021

stale bot commented Apr 17, 2022

AlexandreBrown commented Jan 18, 2023

MinhManPham commented Feb 14, 2023

odovad commented Aug 30, 2023

rimolive commented Apr 21, 2024

google-oss-prow bot commented Apr 21, 2024

hsezhiyan commented Aug 27, 2020 •

edited

Loading