Release 1.7 - TFX taxi cab example failing the deploy step #692

SinaChavoshi · 2019-01-16T04:54:07Z

Fresh deployment the taxi cab TFX example is failing the deploy step with :

++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found

'[' -z '' ']'
++ date +%s
current_time=1547611953
++ expr 1547611953 + 1 - 1547610952
elapsed_time=1002
[[ 1002 -gt 1000 ]]
echo timeout
timeout
exit 1

gaoning777 · 2019-01-16T16:51:17Z

Is it a recurring error or happens rarely? I think I've come across with the error in the past.

SinaChavoshi · 2019-01-16T17:29:29Z

3 runs consistently failed. kicking off the 4th run.

SinaChavoshi · 2019-01-16T17:43:34Z

quick update 4th run also failed.

swiftdiaries · 2019-01-17T01:32:14Z

I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails.
This can be solved by deleting all model-server pods that gets created before starting a fresh run. But this makes recurring / parallel / multiple runs hard.

gaoning777 · 2019-01-17T18:13:20Z

A temporary fix is to delete the deployment.
run "kubectl get deployment -n kubeflow" and look for the taxi-cab-, then
run "kubectl delete deployment taxi-cab- -n kubeflow".

gaoning777 · 2019-01-17T18:17:40Z

In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}}

swiftdiaries · 2019-01-18T18:10:09Z

Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore.

gaoning777 · 2019-01-18T18:37:02Z

Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision.
Will send a PR to randomize the deployer name

gaoning777 · 2019-01-18T18:48:26Z

solved in #704

* The scripts to generate the tests now depend on the jinja2 library but its not in the container. * Add some docs for debugging. * Related to kubeflow/testing#631 * Catch FilenotFoundErrors * The problem is that on master the location of some of the kustomize manifests changes (e.g. v3 versions) but for the v1.0 branches these paths won't exist. So we should just catch these errors and continue. * Update the docker image used by the Tekton pipeline because we need jinja2.

…low#692) * Clarify KFServing pod mutator webhook installation requirement * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> * Update README.md Co-Authored-By: Animesh Singh <singhan@us.ibm.com> Co-authored-by: Animesh Singh <singhan@us.ibm.com>

…v2 (kubeflow#692) * migrate unit tests to component.yaml and verify with v2 * update readme to remove volumesnapshot * address comments

SinaChavoshi changed the title ~~Release 1.7 - taxi cab example failing the deploy step~~ Release 1.7 - TFX taxi cab example failing the deploy step Jan 16, 2019

gaoning777 closed this as completed Jan 22, 2019

gaoning777 self-assigned this Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.7 - TFX taxi cab example failing the deploy step #692

Release 1.7 - TFX taxi cab example failing the deploy step #692

SinaChavoshi commented Jan 16, 2019

gaoning777 commented Jan 16, 2019

SinaChavoshi commented Jan 16, 2019

SinaChavoshi commented Jan 16, 2019

swiftdiaries commented Jan 17, 2019

gaoning777 commented Jan 17, 2019

gaoning777 commented Jan 17, 2019

swiftdiaries commented Jan 18, 2019

gaoning777 commented Jan 18, 2019 •

edited

Loading

gaoning777 commented Jan 18, 2019

Release 1.7 - TFX taxi cab example failing the deploy step #692

Release 1.7 - TFX taxi cab example failing the deploy step #692

Comments

SinaChavoshi commented Jan 16, 2019

gaoning777 commented Jan 16, 2019

SinaChavoshi commented Jan 16, 2019

SinaChavoshi commented Jan 16, 2019

swiftdiaries commented Jan 17, 2019

gaoning777 commented Jan 17, 2019

gaoning777 commented Jan 17, 2019

swiftdiaries commented Jan 18, 2019

gaoning777 commented Jan 18, 2019 • edited Loading

gaoning777 commented Jan 18, 2019

gaoning777 commented Jan 18, 2019 •

edited

Loading