ml-pipeline-persistenceagent fails a few times. #624

TimZaman · 2019-01-04T17:50:15Z

Kubeflow v0.4.0-rc.3
On GKE (with official doc instructions for CLI setup)

Everything sets up well, but the ml-pipeline-persistenceagent has failed 4x. On the fifth retry, the pod is up and running. Here's a log of one of its failures.

If this is nothing to worry about, feel free to just close this issue.

$ kubectl logs ml-pipeline-persistenceagent-9ff99498c-jbq44 -p
W0104 17:36:19.094203       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1072ecd]

goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/common/util.WaitForAPIAvailable.func1(0xc0002ff560, 0xc000302060)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/service.go:38 +0x12d
github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff.RetryNotify(0xc000405020, 0x14d9e60, 0xc0002ff560, 0x0, 0xc000269500, 0xc0002ff560)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff/retry.go:37 +0xa2
github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff.Retry(0xc000405020, 0x14d9e60, 0xc0002ff560, 0x2b, 0x70)
	/go/src/github.com/kubeflow/pipelines/vendor/github.com/cenkalti/backoff/retry.go:25 +0x48
github.com/kubeflow/pipelines/backend/src/common/util.WaitForAPIAvailable(0x1bf08eb000, 0x1371890, 0xd, 0xc000450ff0, 0x2b, 0xc000451020, 0x2b)
	/go/src/github.com/kubeflow/pipelines/backend/src/common/util/service.go:47 +0xb7
github.com/kubeflow/pipelines/backend/src/agent/persistence/client.NewPipelineClient(0xc00004440e, 0x8, 0x1bf08eb000, 0xdf8475800, 0x1371890, 0xd, 0x1370308, 0xb, 0x136817f, 0x4, ...)
	/go/src/github.com/kubeflow/pipelines/backend/src/agent/persistence/client/pipeline_client.go:60 +0x349
main.main()
	/go/src/github.com/kubeflow/pipelines/backend/src/agent/persistence/main.go:83 +0x287

The text was updated successfully, but these errors were encountered:

yebrahim · 2019-01-04T18:41:55Z

I saw that too, 4 crashes then works fine.
/cc @vicaire
/cc @IronPan.
/area back-end

neuromage · 2019-01-04T22:54:05Z

Let me take a look...
/assign @neuromage

hamedhsn · 2019-01-10T12:01:10Z

@neuromage I get the same when deploying v0.4.0. and it does not get recoved.
I am deploying on my k8s cluster in EKS.

nareshganesan · 2019-01-24T11:41:32Z

Hi There,

I'm also facing issue with ml-pipeline-persistenceagent pod.

kubeflow: v0.4.1
ml-pipeline-persistenceagent: gcr.io/ml-pipeline/persistenceagent:0.1.7
Cluster: On-Prem

Here is the log,

1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

Any pointers / suggestions will be really helpful.

Thanks
Naresh Ganesan

neuromage · 2019-01-24T18:02:54Z

That log message is harmless, in that it just means we're running on a cluster (we should probably re-word that). Are you seeing crashes/restarts for this process? What's the output when you run:

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent

You should see something like this:

NAME                                           READY     STATUS    RESTARTS   AGE
ml-pipeline-persistenceagent-1231ewqe21-dwq2   1/1       Running   0          19h

gaoning777 · 2019-01-25T18:38:36Z

related issue: #676

nareshganesan · 2019-01-26T14:09:35Z

@neuromage ,

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-gq29l   0/1       CrashLoopBackOff   204        1d

Thanks for helping out.

neuromage · 2019-01-26T17:46:32Z

Thanks, this looks like #676. I will update that issue when I get to the bottom of the problem.

neuromage · 2019-01-27T05:38:50Z

@nareshganesan did you try kubeflow 0.4.1 ? I can't reproduce ml-pipeline-persistenceagent crashes with that version. I think it's been fixed (0.4.0 may have had the issue you're seeing though).

TimZaman · 2019-01-27T05:49:15Z

They fixed this in 0.4.1 afaik

nareshganesan · 2019-01-27T08:35:21Z

@neuromage , @TimZaman

Thanks for your time.

I'm still facing the issue. Here are my steps.

This is an on-prem cluster.

mkdir ${KUBEFLOW_SRC}
cd ${KUBEFLOW_SRC}
export KUBEFLOW_TAG=v0.4.1

curl https://raw.githubusercontent.com/kubeflow/kubeflow/${KUBEFLOW_TAG}/scripts/download.sh | bash
${KUBEFLOW_SRC}/scripts/kfctl.sh init ${KFAPP} --platform none
cd ${KFAPP}
${KUBEFLOW_SRC}/scripts/kfctl.sh generate k8s
${KUBEFLOW_SRC}/scripts/kfctl.sh apply k8s

logs from ml-pipeline-persistenceagent pod

W0127 08:20:37.093010       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2019-01-27T08:23:01Z" level=fatal msg="Error creating ML pipeline API Server client: Failed to initialize pipeline client. Error: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again: Waiting for ml pipeline API server failed after all attempts.: Get http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz: dial tcp: lookup ml-pipeline.kubeflow.svc.cluster.local: Try again"

Its restarted couple of times on a fresh cluster already.

$ kubectl -n kubeflow get pods --selector=app=ml-pipeline-persistenceagent
NAME                                            READY     STATUS             RESTARTS   AGE
ml-pipeline-persistenceagent-5669f69cdd-h62lq   0/1       CrashLoopBackOff   6          26m

I manually hit the health check url from one of my pod, and it works!

$ kubectl -n kubeflow exec -it mysql-xyxsdf /bin/bash
$ curl http://ml-pipeline.kubeflow.svc.cluster.local:8888/apis/v1beta1/healthz
{"commit_sha":"d9a1313b88d9a0db52792016f8faab91f9cb4bae"}

Please let me know.

TimZaman · 2019-01-27T08:36:49Z

From your error, looks like this is unrelated to this current issue. Please create a new, separate issue for your problem: https://github.com/kubeflow/pipelines/issues/new

nareshganesan · 2019-01-27T08:46:39Z

@TimZaman - Thanks, I've created a new issue. #741

…the cluster (kubeflow#625) * Use Anthos Config Management to continually sync the label config to the cluster * Define a kustomize package to generate the config map from the labels * ACM requires the directory structure follow a certain format so we created the directory label_sync/acm_repo * A simple shell script runs kustomize to dump the manifests to the directory. Related to kubeflow#624 * Address comments.

k8s-ci-robot added the area/backend label Jan 4, 2019

k8s-ci-robot assigned neuromage Jan 4, 2019

neuromage mentioned this issue Jan 4, 2019

Fix retrying logic which was causing persistenceagent to crash loop. #633

Merged

k8s-ci-robot closed this as completed in #633 Jan 5, 2019

snyk-bot mentioned this issue May 26, 2021

[Snyk] Upgrade @kubernetes/client-node from 0.8.2 to 0.14.3 MaxKelsen/pipelines#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml-pipeline-persistenceagent fails a few times. #624

ml-pipeline-persistenceagent fails a few times. #624

TimZaman commented Jan 4, 2019 •

edited

Loading

yebrahim commented Jan 4, 2019 •

edited

Loading

neuromage commented Jan 4, 2019

hamedhsn commented Jan 10, 2019

nareshganesan commented Jan 24, 2019

neuromage commented Jan 24, 2019

gaoning777 commented Jan 25, 2019

nareshganesan commented Jan 26, 2019

neuromage commented Jan 26, 2019

neuromage commented Jan 27, 2019

TimZaman commented Jan 27, 2019

nareshganesan commented Jan 27, 2019 •

edited

Loading

TimZaman commented Jan 27, 2019 •

edited

Loading

nareshganesan commented Jan 27, 2019

ml-pipeline-persistenceagent fails a few times. #624

ml-pipeline-persistenceagent fails a few times. #624

Comments

TimZaman commented Jan 4, 2019 • edited Loading

yebrahim commented Jan 4, 2019 • edited Loading

neuromage commented Jan 4, 2019

hamedhsn commented Jan 10, 2019

nareshganesan commented Jan 24, 2019

neuromage commented Jan 24, 2019

gaoning777 commented Jan 25, 2019

nareshganesan commented Jan 26, 2019

neuromage commented Jan 26, 2019

neuromage commented Jan 27, 2019

TimZaman commented Jan 27, 2019

nareshganesan commented Jan 27, 2019 • edited Loading

TimZaman commented Jan 27, 2019 • edited Loading

nareshganesan commented Jan 27, 2019

TimZaman commented Jan 4, 2019 •

edited

Loading

yebrahim commented Jan 4, 2019 •

edited

Loading

nareshganesan commented Jan 27, 2019 •

edited

Loading

TimZaman commented Jan 27, 2019 •

edited

Loading