-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ml-pipeline-persistenceagent restarts forever #741
Comments
Looks like ml pipeline API server is failing. Maybe something to do with the fact that it's on-prem? Can you please report back what the API server logs say? Something like:
|
Sorry for the late respone. I see empty response when I tail 'ml-pipeline api server' logs.
Thanks for helping out. |
@nareshganesan is the issue still happening? Can you see the error logs of the persistence agent? /assign @nareshganesan |
Yeah the issue is still happening.
persistenceagent logs
Please let me know. Thanks for helping out! |
@nareshganesan
|
@nareshganesan If you verified the API server actually starts up and running, it might be caused by DNS resolution failure. Some links might be helpful for debugging |
Thanks for your inputs. The DNS service was the issue, it was not able to find my ml-persistentagent pod. Our current cluster, was spun up using kubeadm (kubernetes v1.11.6), without using coredns feature gate flag. To validate,
Thanks @neuromage @paveldournov @IronPan 👍 I'll close the issue. |
Thanks for the update @nareshganesan! That'll be useful for us when debugging issues like this in the future as well. |
* updated troubleshooting links in Installation readme * updated install links * Update guides/kfp_tekton_install.md Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com> Co-authored-by: Andrew Butler <Andrew.Butler@ibm.com>
Kubeflow v0.4.1
On Prem
All other components work well, but the ml-pipeline-persistenceagent keeps restarting forever.
Steps:
logs from ml-pipeline-persistenceagent pod
Its restarted couple of times on a fresh cluster already.
I manually hit the health check url from one of my pod, and it works!
Please let me know.
The text was updated successfully, but these errors were encountered: