Enhance pipeline TFX taxi sample to support on-prem cluster #749

jinchihe · 2019-01-29T17:47:57Z

fixes: #721

Enhance pipeline TFX taxi sample to support on-prem cluster

This change is

jinchihe · 2019-01-29T17:50:01Z

/assign @IronPan

IronPan · 2019-01-29T22:59:45Z

/assign @gaoning777 @Ark-kun

samples/tfx/README.md

backend/src/apiserver/config/sample_config.json

samples/tfx/README.md

test/sample-test/run_test.sh

jinchihe · 2019-02-02T06:07:53Z

/test kubeflow-pipeline-e2e-test

neuromage · 2019-02-08T00:07:05Z

Thanks for the PR!

Overall, the direction to support on-prem is great. However, I'm not in favour of having two versions of the pipeline, one for GCP and one using NFS, as it increases maintenance cost. Ideally, we should have one pipeline, with a user-configurable option of whether to store artifacts on GCP or NFS. Would it be feasible to explore this approach instead?

gaoning777 · 2019-02-08T16:36:55Z

echoed @neuromage

gyliu513 · 2019-02-11T12:59:38Z

@neuromage good point, let me check with @jinchihe for how we can move forward.

jinchihe · 2019-02-11T13:58:30Z

Overall, the direction to support on-prem is great. However, I'm not in favour of having two versions of the pipeline, one for GCP and one using NFS, as it increases maintenance cost. Ideally, we should have one pipeline, with a user-configurable option of whether to store artifacts on GCP or NFS. Would it be feasible to explore this approach instead?

@neuromage @gaoning777 @gyliu513

In fact, we discussed the solution in the original issue #721, as @swiftdiaries mentioned in #721, it's better to have separate example code. The onprem wouldn't be using gcp secrets so having another tfx_taxi_cab_onprem.py makes sense. And we'd be needing volumes, so the code would look very different.

vicaire · 2019-02-13T07:59:03Z

@jinchihe, what do you think of the DSL changes propose at #801 to support volumes in the DSL?

jinchihe · 2019-02-14T03:37:29Z

@jinchihe, what do you think of the DSL changes propose at #801 to support volumes in the DSL?

@vicaire Great for supporting local storage easily. So I think we can hold on the PR then. Thanks.

vicaire · 2019-02-14T04:24:41Z

OK thanks. Let's conclude the discussion on #801 and hold on this PR till then.

cvenets · 2019-02-15T00:54:39Z

Hello @jinchihe, I was testing your PR today, trying to run the whole example completely on-prem.

Although everything looks good, I'm getting an error in the last step of the pipeline, the kubeflow_deploy_op. The log states:

+ KUBERNETES_NAMESPACE=kubeflow
+ SERVER_NAME=model-server
+ (( 8 ))
+ case $1 in
+ shift
+ CLUSTER_NAME=tfx-taxi-pipeline-on-prem
+ shift
+ (( 6 ))
+ case $1 in
+ shift
+ MODEL_PATH=/mnt/tfx-taxi-cab-classification-pipeline-example-z85z5/train
+ shift
+ (( 4 ))
+ case $1 in
+ shift
+ SERVER_NAME=taxi-cab-classification-model-tfx-taxi-cab-classification-pipeline-example-z85z5
+ shift
+ (( 2 ))
+ case $1 in
+ echo 'Unknown argument: '\''--pvc-name'\'''
Unknown argument: '--pvc-name'
+ exit 1

After digging in a bit, looking into the container image, I understand that the container image for the deploy part, which in your code is image='gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:be19cbc2591a48d2ef5ca715c34ecae8223cf454', contains a deploy.sh script that doesn't support either the --model-storage-type or --pvc-name arguments. Thus the error.

Can you point me to the image you used to run the sample successfully?

Thank you,
Constantinos

jinchihe · 2019-02-15T02:03:49Z

Hello @jinchihe, I was testing your PR today, trying to run the whole example completely on-prem.

Although everything looks good, I'm getting an error in the last step of the pipeline, the kubeflow_deploy_op. The log states:
+ KUBERNETES_NAMESPACE=kubeflow
+ SERVER_NAME=model-server
+ (( 8 ))
+ case $1 in
+ shift
+ CLUSTER_NAME=tfx-taxi-pipeline-on-prem
+ shift
+ (( 6 ))
+ case $1 in
+ shift
+ MODEL_PATH=/mnt/tfx-taxi-cab-classification-pipeline-example-z85z5/train
+ shift
+ (( 4 ))
+ case $1 in
+ shift
+ SERVER_NAME=taxi-cab-classification-model-tfx-taxi-cab-classification-pipeline-example-z85z5
+ shift
+ (( 2 ))
+ case $1 in
+ echo 'Unknown argument: '\''--pvc-name'\'''
Unknown argument: '--pvc-name'
+ exit 1
After digging in a bit, looking into the container image, I understand that the container image for the deploy part, which in your code is image='gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:be19cbc2591a48d2ef5ca715c34ecae8223cf454', contains a deploy.sh script that doesn't support either the --model-storage-type or --pvc-name arguments. Thus the error.

Can you point me to the image you used to run the sample successfully?

Thank you,
Constantinos

Thanks for the tests @cvenets. The problem has been fixed in #755, due to the RP is proposed to hold on, the image version is not updated.
I just rebased and updated to new image. That's should be OK now. :-)

test/sample-test/run_test.sh

samples/tfx/taxi-cab-classification-pipeline-on-prem.py

samples/tfx/README.md

cvenets · 2019-02-15T05:14:11Z

Hello @jinchihe, thank you for updating the pipeline code and image.

Unfortunately, when trying to run the updated pipeline the first step (validation) now breaks with the following log. This didn't happen with the previous pipeline code, all steps finished successfully except for the deployment part. Any thoughts?

I see that I'm running the updated image now: image: gcr.io/ml-pipeline/ml-pipeline-dataflow-tfdv:5df2cdc1ed145320204e8bc73b59cdbd7b3da28f

tfx-taxi-cab-classification-pipeline-example-for-on-prem-cqg2mn-3535214929
This step is in Failed state with this message: failed with exit code 1
    from tensorflow_transform.coders.csv_coder import CsvCoder
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 26, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 22, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 81, in <module>
    from tensorflow.python import keras
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/__init__.py", line 24, in <module>
    from tensorflow.python.keras import activations
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module>
    from tensorflow.python.keras._impl.keras.activations import elu
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module>
    from tensorflow.python.keras._impl.keras import activations
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module>
    from tensorflow.python.keras._impl.keras import backend as K
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 37, in <module>
    from tensorflow.python.layers import base as tf_base_layers
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 25, in <module>
    from tensorflow.python.keras.engine import base_layer
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/__init__.py", line 21, in <module>
    from tensorflow.python.keras.engine.base_layer import InputSpec
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 33, in <module>
    from tensorflow.python.keras import backend
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module>
    from tensorflow.python.keras._impl.keras.backend import abs
ImportError: cannot import name abs

Thank you,
Constantinos

backend/src/apiserver/config/sample_config.json

jinchihe · 2019-05-15T09:55:20Z

Seems samples/tfx/taxi-cab-classification-pipeline.py has conflicts again.

Now I think it's better to merge the taxi-cab-classification-pipeline-on-prem.py and taxi-cab-classification-pipeline-gcp.py into one file in the PR, if so, easy maintenance and avoid code duplication. @Ark-kun @gaoning777 WDYT?

Personally I strongly suggest to merge the PR, since so many people pay attention to the PR and ask questions about running the sample on prem (local storage) by Slack and Email. :-)

Ark-kun · 2019-05-15T22:25:56Z

has conflicts again

Sorry for the interruptions. There was a release yesterday and releases currently change the SHA hashes in the component paths.

Now I think it's better to merge the taxi-cab-classification-pipeline-on-prem.py and taxi-cab-classification-pipeline-gcp.py into one file in the PR, if so, easy maintenance and avoid code duplication. @Ark-kun @gaoning777 WDYT?

It would be great.
#1209 will make it much easier.

jinchihe · 2019-05-17T08:50:18Z

@Ark-kun Thanks. here's what I'm thinking. Please review and comment. Thanks a lot!

Due to there are some differences for gcp and on-prem cluster for parameters and ContainerOp , such as I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp, that means each component will has different parameter between gcp and on-prem, as on-prem will add pvolumes:

    cop = dsl.ContainerOp(
        ...
        pvolumes={"/mnt": vop.volume}
    )

So here's what I'm thinking: add new function in taxi-cab-classification-pipeline.py, after the change, one function is for gcp and another function is for on-prem, as following:

...

@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example for GCP',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset for GCP.'
)
def taxi_cab_classification_gcp(
    output,
    project,
    column_names='gs://ml-pipeline-playground/tfx/taxi-cab-classification/column-names.json',

...

@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example for on-prem cluster',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset for on-prem cluster.'
)
def taxi_cab_classification_onprem(
    project='taxi-cab-classification-pipeline-on-prem',
    relative_column_names='taxi-cab-classification/column-names.json',

...

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(taxi_cab_classification_gcp, __file__ + '_gcp.zip')
    kfp.compiler.Compiler().compile(taxi_cab_classification_onprem, __file__ + '_onprem.zip')

And compile with --function, as below

GCP:

dsl_compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline_gcp.tar.gz --function taxi_cab_classification_gcp

on premise cluster.

dsl_compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline_onprem.tar.gz --function taxi_cab_classification_onprem

elikatsis · 2019-05-17T12:34:04Z

Hi @jinchihe,

[...] such as I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp, that means each component will has different parameter between gcp and on-prem, as on-prem will add pvolumes: [...]

Thank you for integrating with the work merged at #926 and I'm glad you find pvolumes convenient.
Is there something you have bumped into while using pvolumes? Please let me know.

I have not used components extensively, what is the easiest way to use a pvolume instance with a component? Looking at code examples, I think you'd find an .add_pvolumes() method very helpful, so you could do, for example,

vop = dsl.VolumeOp(...)
validation = dataflow_tf_data_validation_op(...).add_pvolumes({pvc_mount_path: vop.volume})
preprocess = dataflow_tf_transform_op(...).add_pvolumes({pvc_mount_path: validation.pvolume})
# or preprocess = dataflow_tf_transform_op(...).add_pvolumes({pvc_mount_path: vop.volume})
# since in this sample all the dependencies are resolved using parameters
...

I'm preparing a WIP PR so you can test.

Ark-kun · 2019-05-18T00:22:35Z

I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp

What about creating the volume, but then using onprem.mount_pvc to mount it to the steps?

jinchihe · 2019-05-20T02:11:09Z

What about creating the volume, but then using onprem.mount_pvc to mount it to the steps?

@Ark-kun Thanks, This method onprem.mount_pvc is currently used. :-) If using dsl.VolumeOp to create the pvc , I think that may be better way to use .add_pvolumes method which mentioned by @elikatsis (#1353 is on the way), seems the method add_pvolumes has similar function with onprem.mount_pvc. I will try that once get chance. Thanks.

jinchihe · 2019-05-21T07:39:13Z

/assign @Ark-kun

I just updated the PR as following:

For easy maintenance and to avoid code duplication, I merged the pipeline definition files for GCP and On-Prem cluster.
For On-Prem cluster, I added two steps for creating PVC and download code automatically.
For sample_config.json, I removed the changes and only keep one simple that's for GCP.
Define the variable platform to switch GCP and On-Prem, by default that's for GCP, if use want to try the pipeline on On-Prem cluster, need to update simplely as README description.
For the code to mount pvc and GCP use_gcp_secret, we can enhance later once the PR SDK/Compiler - Added the ability to apply a function to all ops in a pipeline #1209 merged.

Any more comments please let me know. Thanks a lot!

jinchihe · 2019-05-30T03:08:27Z

@Ark-kun Any more comments. I noticed the #1209 has been merged, but I'd like to keep existing implement since some steps no need to apply pvc, and I used for loop to mount that and I think the code is clear :-)

Ark-kun · 2019-05-31T21:23:34Z

/lgtm
/approve

Sorry for the delays!

k8s-ci-robot · 2019-05-31T21:23:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~samples/OWNERS~~ [Ark-kun]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jinchihe · 2019-06-01T06:49:41Z

/retest

cdyangzhenyu · 2019-06-12T09:00:22Z

@jinchihe Hi jinchi, I was testing the TFX demo on prem today. The pipeline workflow worked fine, but the web ui has some error, I can't open the artifacts ui. I find some js error: 'Uncaught (in promise) Error: Unsupported storage path: /mnt/a07b8215-8c1a-11e9-a2ff-525400ed33aa/tfx-taxi-cab-classification-pipeline-example-q7rtq-2449399348/data/roc.csv'.
How to solve this problem? Thanks.

jinchihe · 2019-06-12T10:06:51Z

@cdyangzhenyu I think the Pipeline Web UI does not support get the artifacts from local path, but just from gs:// and s3:// etc... May be support this later. Thanks.

* Fix kfserving ingressgateway typo * Add troube shooting for IngressNotConfigured error

* Add doc for break feature for loop * Fix typos Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>

k8s-ci-robot requested review from gaoning777, hongye-sun and Ark-kun January 29, 2019 17:48

k8s-ci-robot added the size/L label Jan 29, 2019

k8s-ci-robot assigned IronPan Jan 29, 2019

k8s-ci-robot assigned Ark-kun and gaoning777 Jan 29, 2019

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 1d6deab to abf32ac Compare January 30, 2019 01:34

gyliu513 reviewed Jan 30, 2019

View reviewed changes

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from abf32ac to 6b33f16 Compare January 31, 2019 09:59

vicaire assigned hongye-sun and neuromage Feb 7, 2019

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 6b33f16 to e159250 Compare February 15, 2019 01:56

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from e159250 to 81d5c93 Compare February 15, 2019 02:17

gyliu513 reviewed Feb 15, 2019

View reviewed changes

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 81d5c93 to 7a47e60 Compare February 15, 2019 03:16

Ark-kun mentioned this pull request May 14, 2019

SDK/Compiler - Added the ability to apply a function to all ops in a pipeline #1209

Merged

Ark-kun reviewed May 14, 2019

View reviewed changes

backend/src/apiserver/config/sample_config.json Outdated Show resolved Hide resolved

elikatsis mentioned this pull request May 17, 2019

SDK/Compiler: Add add_pvolumes() method to ContainerOp #1353

Merged

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch 2 times, most recently from 78c5c65 to e86d246 Compare May 21, 2019 07:27

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from e86d246 to 69dac3f Compare May 21, 2019 13:40

k8s-ci-robot added size/M and removed size/L labels May 21, 2019

Enhance pipeline TFX taxi sample to support on-prem cluster

57a7237

jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 69dac3f to 57a7237 Compare May 21, 2019 14:13

jinchihe mentioned this pull request May 22, 2019

example TFX taxi support on-prem cluster #721

Closed

k8s-ci-robot added the lgtm label May 31, 2019

k8s-ci-robot added the approved label May 31, 2019

k8s-ci-robot merged commit 895645c into kubeflow:master Jun 1, 2019

jinchihe deleted the update_tfx_taxi_sample_to_add_onprem branch June 12, 2019 09:33

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023

Add trouble shooting for IngressNotConfigured error (kubeflow#749)

70b2c77

* Fix kfserving ingressgateway typo * Add troube shooting for IngressNotConfigured error

HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024

Add doc for break feature for loop (kubeflow#749)

01ed233

* Add doc for break feature for loop * Fix typos Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance pipeline TFX taxi sample to support on-prem cluster #749

Enhance pipeline TFX taxi sample to support on-prem cluster #749

jinchihe commented Jan 29, 2019 •

edited

Loading

jinchihe commented Jan 29, 2019

IronPan commented Jan 29, 2019

jinchihe commented Feb 2, 2019

neuromage commented Feb 8, 2019

gaoning777 commented Feb 8, 2019

gyliu513 commented Feb 11, 2019

jinchihe commented Feb 11, 2019

vicaire commented Feb 13, 2019

jinchihe commented Feb 14, 2019 •

edited

Loading

vicaire commented Feb 14, 2019

cvenets commented Feb 15, 2019

jinchihe commented Feb 15, 2019

cvenets commented Feb 15, 2019

jinchihe commented May 15, 2019

Ark-kun commented May 15, 2019 •

edited

Loading

jinchihe commented May 17, 2019

elikatsis commented May 17, 2019

Ark-kun commented May 18, 2019

jinchihe commented May 20, 2019 •

edited

Loading

jinchihe commented May 21, 2019 •

edited

Loading

jinchihe commented May 30, 2019

Ark-kun commented May 31, 2019

k8s-ci-robot commented May 31, 2019

jinchihe commented Jun 1, 2019

cdyangzhenyu commented Jun 12, 2019

jinchihe commented Jun 12, 2019

Enhance pipeline TFX taxi sample to support on-prem cluster #749

Enhance pipeline TFX taxi sample to support on-prem cluster #749

Conversation

jinchihe commented Jan 29, 2019 • edited Loading

jinchihe commented Jan 29, 2019

IronPan commented Jan 29, 2019

jinchihe commented Feb 2, 2019

neuromage commented Feb 8, 2019

gaoning777 commented Feb 8, 2019

gyliu513 commented Feb 11, 2019

jinchihe commented Feb 11, 2019

vicaire commented Feb 13, 2019

jinchihe commented Feb 14, 2019 • edited Loading

vicaire commented Feb 14, 2019

cvenets commented Feb 15, 2019

jinchihe commented Feb 15, 2019

cvenets commented Feb 15, 2019

jinchihe commented May 15, 2019

Ark-kun commented May 15, 2019 • edited Loading

jinchihe commented May 17, 2019

elikatsis commented May 17, 2019

Ark-kun commented May 18, 2019

jinchihe commented May 20, 2019 • edited Loading

jinchihe commented May 21, 2019 • edited Loading

jinchihe commented May 30, 2019

Ark-kun commented May 31, 2019

k8s-ci-robot commented May 31, 2019

jinchihe commented Jun 1, 2019

cdyangzhenyu commented Jun 12, 2019

jinchihe commented Jun 12, 2019

jinchihe commented Jan 29, 2019 •

edited

Loading

jinchihe commented Feb 14, 2019 •

edited

Loading

Ark-kun commented May 15, 2019 •

edited

Loading

jinchihe commented May 20, 2019 •

edited

Loading

jinchihe commented May 21, 2019 •

edited

Loading