Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 1.7 - TFX taxi cab example failing the deploy step #692

Closed
SinaChavoshi opened this issue Jan 16, 2019 · 9 comments
Closed

Release 1.7 - TFX taxi cab example failing the deploy step #692

SinaChavoshi opened this issue Jan 16, 2019 · 9 comments
Assignees

Comments

@SinaChavoshi
Copy link
Contributor

Fresh deployment the taxi cab TFX example is failing the deploy step with :

++ kubectl get po taxi-cab-classification-model-tfx-taxi-cab-classification-5lxrp taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9 --namespace kubeflow -o 'jsonpath={.status.containerStatuses[0].state.running}'
Error from server (NotFound): pods "taxi-cab-classification-model-tfx-taxi-cab-classification-67xc9" not found

  • '[' -z '' ']'
    ++ date +%s
  • current_time=1547611953
    ++ expr 1547611953 + 1 - 1547610952
  • elapsed_time=1002
  • [[ 1002 -gt 1000 ]]
  • echo timeout
    timeout
  • exit 1
@SinaChavoshi SinaChavoshi changed the title Release 1.7 - taxi cab example failing the deploy step Release 1.7 - TFX taxi cab example failing the deploy step Jan 16, 2019
@gaoning777
Copy link
Contributor

Is it a recurring error or happens rarely? I think I've come across with the error in the past.

@SinaChavoshi
Copy link
Contributor Author

3 runs consistently failed. kicking off the 4th run.

@SinaChavoshi
Copy link
Contributor Author

quick update 4th run also failed.

@swiftdiaries
Copy link
Member

I think this is what's happening: When you have multiple runs of the deployer step, you don't have unique deployments. You have only one deployment (model-server-v1). So when you try to fetch the TFServing pod corresponding to your run, it lists all pods created from previous runs. And it chooses the alpha-numerically first pod and since it can't find the pod corresponding to the run, the step fails.
This can be solved by deleting all model-server pods that gets created before starting a fresh run. But this makes recurring / parallel / multiple runs hard.

@gaoning777
Copy link
Contributor

A temporary fix is to delete the deployment.
run "kubectl get deployment -n kubeflow" and look for the taxi-cab-, then
run "kubectl delete deployment taxi-cab-
-n kubeflow".

@gaoning777
Copy link
Contributor

In fact, the deployer has been using configurable names for the deployment and the examples are generating the name using {{workflow.name}}

@swiftdiaries
Copy link
Member

Tried this. Passed the workflow name as a parameter using the --server-name flag and it doesn't happen anymore.

@gaoning777
Copy link
Contributor

gaoning777 commented Jan 18, 2019

Found the bug: the deployer component truncated the deploy name to 64 bytes, which removes the distinct part of the workflow name, thus naming collision.
Will send a PR to randomize the deployer name

@gaoning777
Copy link
Contributor

solved in #704

@gaoning777 gaoning777 self-assigned this Jan 22, 2019
Linchin pushed a commit to Linchin/pipelines that referenced this issue Apr 11, 2023
* The scripts to generate the tests now depend on the jinja2 library but
  its not in the container.

* Add some docs for debugging.

* Related to kubeflow/testing#631

* Catch FilenotFoundErrors
  * The problem is that on master the location of some of the kustomize
    manifests changes (e.g. v3 versions) but for the v1.0 branches
    these paths won't exist. So we should just catch these errors
    and continue.

* Update the docker image used by the Tekton pipeline because we need jinja2.
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this issue Oct 22, 2023
…low#692)

* Clarify KFServing pod mutator webhook installation requirement

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

* Update README.md

Co-Authored-By: Animesh Singh <singhan@us.ibm.com>

Co-authored-by: Animesh Singh <singhan@us.ibm.com>
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this issue Mar 11, 2024
…v2 (kubeflow#692)

* migrate unit tests to component.yaml and verify with v2

* update readme to remove volumesnapshot

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants