Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ml-pipeline-persistenceagent fails a few times. #624

Closed
TimZaman opened this issue Jan 4, 2019 · 13 comments · Fixed by #633
Closed

ml-pipeline-persistenceagent fails a few times. #624

TimZaman opened this issue Jan 4, 2019 · 13 comments · Fixed by #633
Assignees

Comments

@TimZaman
Copy link
Contributor

TimZaman commented Jan 4, 2019

Kubeflow v0.4.0-rc.3
On GKE (with official doc instructions for CLI setup)

Everything sets up well, but the ml-pipeline-persistenceagent has failed 4x. On the fifth retry, the pod is up and running. Here's a log of one of its failures.

If this is nothing to worry about, feel free to just close this issue.

$ kubectl logs ml-pipeline-persistenceagent-9ff99498c-jbq44 -p
W0104 17:36:19.094203       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1072ecd]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/common/util.WaitForAPIAvailable.func1(0xc0002ff560, 0xc000302060)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/service.go:38 +0x12d
github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff.RetryNotify(0xc000405020, 0x14d9e60, 0xc0002ff560, 0x0, 0xc000269500, 0xc0002ff560)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff/retry.go:37 +0xa2
github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff.Retry(0xc000405020, 0x14d9e60, 0xc0002ff560, 0x2b, 0x70)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff/retry.go:25 +0x48
github.com/kubeflow/pipelines/backend/src/common/util.WaitForAPIAvailable(0x1bf08eb000, 0x1371890, 0xd, 0xc000450ff0, 0x2b, 0xc000451020, 0x2b)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/service.go:47 +0xb7
github.com/kubeflow/pipelines/backend/src/agent/persistence/client.NewPipelineClient(0xc00004440e, 0x8, 0x1bf08eb000, 0xdf8475800, 0x1371890, 0xd, 0x1370308, 0xb, 0x136817f, 0x4, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/agent/persistence/client/pipeline_client.go:60 +0x349
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/agent/persistence/main.go:83 +0x287
@yebrahim
Copy link
Contributor

yebrahim commented Jan 4, 2019

I saw that too, 4 crashes then works fine.
/cc @vicaire
/cc @IronPan.
/area back-end

@neuromage
Copy link
Contributor

Let me take a look...
/assign @neuromage

@hamedhsn
Copy link
Contributor

@neuromage I get the same when deploying v0.4.0. and it does not get recoved.
I am deploying on my k8s cluster in EKS.

@nareshganesan
Copy link

Hi There,

I'm also facing issue with ml-pipeline-persistenceagent pod.

kubeflow: v0.4.1
ml-pipeline-persistenceagent: gcr.io/ml-pipeline/persistenceagent:0.1.7
Cluster: On-Prem

Here is the log,

1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

Any pointers / suggestions will be really helpful.

Thanks
Naresh Ganesan

@neuromage
Copy link
Contributor

That log message is harmless, in that it just means we're running on a cluster (we should probably re-word that). Are you seeing crashes/restarts for this process? What's the output when you run:

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent

You should see something like this:

NAME                                           READY     STATUS    RESTARTS   AGE
ml-pipeline-persistenceagent-1231ewqe21-dwq2   1/1       Running   0          19h

@gaoning777
Copy link
Contributor

related issue: #676

@nareshganesan
Copy link

@neuromage ,

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-gq29l   0/1       CrashLoopBackOff   204        1d

Thanks for helping out.

@neuromage
Copy link
Contributor

Thanks, this looks like #676. I will update that issue when I get to the bottom of the problem.

@neuromage
Copy link
Contributor

@nareshganesan did you try kubeflow 0.4.1 ? I can't reproduce ml-pipeline-persistenceagent crashes with that version. I think it's been fixed (0.4.0 may have had the issue you're seeing though).

@TimZaman
Copy link
Contributor Author

They fixed this in 0.4.1 afaik

@nareshganesan
Copy link

nareshganesan commented Jan 27, 2019

@neuromage , @TimZaman

Thanks for your time.

I'm still facing the issue. Here are my steps.

This is an on-prem cluster.

mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
export KUBEFLOW_TAG=v0.4.1

curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none
cd ${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

logs from ml-pipeline-persistenceagent pod

W0127 08:20:37.093010       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-01-27T08:23:01Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Its restarted couple of times on a fresh cluster already.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-h62lq   0/1       CrashLoopBackOff   6          26m

I manually hit the health check url from one of my pod, and it works!

$ kubectl -n kubeflow exec -it mysql-xyxsdf /bin/bash
$ curl http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz
{"commit_sha":"d9a1313b88d9a0db52792016f8faab91f9cb4bae"}

Please let me know.

@TimZaman
Copy link
Contributor Author

TimZaman commented Jan 27, 2019

From your error, looks like this is unrelated to this current issue. Please create a new, separate issue for your problem: https://github.com/kubeflow/pipelines/issues/new

@nareshganesan
Copy link

@TimZaman - Thanks, I've created a new issue. #741

Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
…the cluster (kubeflow#625)

* Use Anthos Config Management to continually sync the label config to the cluster

* Define a kustomize package to generate the config map from the labels

* ACM requires the directory structure follow a certain format so we
  created the directory label_sync/acm_repo

* A simple shell script runs kustomize to dump the manifests to the directory.

Related to kubeflow#624

* Address comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants