Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance pipeline TFX taxi sample to support on-prem cluster #749

Merged

Conversation

jinchihe
Copy link
Member

@jinchihe jinchihe commented Jan 29, 2019

fixes: #721

Enhance pipeline TFX taxi sample to support on-prem cluster


This change is Reviewable

@jinchihe
Copy link
Member Author

/assign @IronPan

@IronPan
Copy link
Member

IronPan commented Jan 29, 2019

/assign @gaoning777 @Ark-kun

samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
backend/src/apiserver/config/sample_config.json Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
test/sample-test/run_test.sh Outdated Show resolved Hide resolved
test/sample-test/run_test.sh Outdated Show resolved Hide resolved
@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from abf32ac to 6b33f16 Compare January 31, 2019 09:59
@jinchihe
Copy link
Member Author

jinchihe commented Feb 2, 2019

/test kubeflow-pipeline-e2e-test

@neuromage
Copy link
Contributor

Thanks for the PR!

Overall, the direction to support on-prem is great. However, I'm not in favour of having two versions of the pipeline, one for GCP and one using NFS, as it increases maintenance cost. Ideally, we should have one pipeline, with a user-configurable option of whether to store artifacts on GCP or NFS. Would it be feasible to explore this approach instead?

@gaoning777
Copy link
Contributor

echoed @neuromage

@gyliu513
Copy link
Member

@neuromage good point, let me check with @jinchihe for how we can move forward.

@jinchihe
Copy link
Member Author

Overall, the direction to support on-prem is great. However, I'm not in favour of having two versions of the pipeline, one for GCP and one using NFS, as it increases maintenance cost. Ideally, we should have one pipeline, with a user-configurable option of whether to store artifacts on GCP or NFS. Would it be feasible to explore this approach instead?

@neuromage @gaoning777 @gyliu513

In fact, we discussed the solution in the original issue #721, as @swiftdiaries mentioned in #721, it's better to have separate example code. The onprem wouldn't be using gcp secrets so having another tfx_taxi_cab_onprem.py makes sense. And we'd be needing volumes, so the code would look very different.

@vicaire
Copy link
Contributor

vicaire commented Feb 13, 2019

@jinchihe, what do you think of the DSL changes propose at #801 to support volumes in the DSL?

@jinchihe
Copy link
Member Author

jinchihe commented Feb 14, 2019

@jinchihe, what do you think of the DSL changes propose at #801 to support volumes in the DSL?

@vicaire Great for supporting local storage easily. So I think we can hold on the PR then. Thanks.

@vicaire
Copy link
Contributor

vicaire commented Feb 14, 2019

OK thanks. Let's conclude the discussion on #801 and hold on this PR till then.

@cvenets
Copy link

cvenets commented Feb 15, 2019

Hello @jinchihe, I was testing your PR today, trying to run the whole example completely on-prem.

Although everything looks good, I'm getting an error in the last step of the pipeline, the kubeflow_deploy_op. The log states:

+ KUBERNETES_NAMESPACE=kubeflow
+ SERVER_NAME=model-server
+ (( 8 ))
+ case $1 in
+ shift
+ CLUSTER_NAME=tfx-taxi-pipeline-on-prem
+ shift
+ (( 6 ))
+ case $1 in
+ shift
+ MODEL_PATH=/mnt/tfx-taxi-cab-classification-pipeline-example-z85z5/train
+ shift
+ (( 4 ))
+ case $1 in
+ shift
+ SERVER_NAME=taxi-cab-classification-model-tfx-taxi-cab-classification-pipeline-example-z85z5
+ shift
+ (( 2 ))
+ case $1 in
+ echo 'Unknown argument: '\''--pvc-name'\'''
Unknown argument: '--pvc-name'
+ exit 1

After digging in a bit, looking into the container image, I understand that the container image for the deploy part, which in your code is image='gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:be19cbc2591a48d2ef5ca715c34ecae8223cf454', contains a deploy.sh script that doesn't support either the --model-storage-type or --pvc-name arguments. Thus the error.

Can you point me to the image you used to run the sample successfully?

Thank you,
Constantinos

@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 6b33f16 to e159250 Compare February 15, 2019 01:56
@jinchihe
Copy link
Member Author

Hello @jinchihe, I was testing your PR today, trying to run the whole example completely on-prem.

Although everything looks good, I'm getting an error in the last step of the pipeline, the kubeflow_deploy_op. The log states:

+ KUBERNETES_NAMESPACE=kubeflow
+ SERVER_NAME=model-server
+ (( 8 ))
+ case $1 in
+ shift
+ CLUSTER_NAME=tfx-taxi-pipeline-on-prem
+ shift
+ (( 6 ))
+ case $1 in
+ shift
+ MODEL_PATH=/mnt/tfx-taxi-cab-classification-pipeline-example-z85z5/train
+ shift
+ (( 4 ))
+ case $1 in
+ shift
+ SERVER_NAME=taxi-cab-classification-model-tfx-taxi-cab-classification-pipeline-example-z85z5
+ shift
+ (( 2 ))
+ case $1 in
+ echo 'Unknown argument: '\''--pvc-name'\'''
Unknown argument: '--pvc-name'
+ exit 1

After digging in a bit, looking into the container image, I understand that the container image for the deploy part, which in your code is image='gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:be19cbc2591a48d2ef5ca715c34ecae8223cf454', contains a deploy.sh script that doesn't support either the --model-storage-type or --pvc-name arguments. Thus the error.

Can you point me to the image you used to run the sample successfully?

Thank you,
Constantinos

Thanks for the tests @cvenets. The problem has been fixed in #755, due to the RP is proposed to hold on, the image version is not updated.
I just rebased and updated to new image. That's should be OK now. :-)

@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from e159250 to 81d5c93 Compare February 15, 2019 02:17
test/sample-test/run_test.sh Outdated Show resolved Hide resolved
samples/tfx/taxi-cab-classification-pipeline-on-prem.py Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
samples/tfx/README.md Outdated Show resolved Hide resolved
@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from 81d5c93 to 7a47e60 Compare February 15, 2019 03:16
@cvenets
Copy link

cvenets commented Feb 15, 2019

Hello @jinchihe, thank you for updating the pipeline code and image.

Unfortunately, when trying to run the updated pipeline the first step (validation) now breaks with the following log. This didn't happen with the previous pipeline code, all steps finished successfully except for the deployment part. Any thoughts?

I see that I'm running the updated image now: image: gcr.io/ml-pipeline/ml-pipeline-dataflow-tfdv:5df2cdc1ed145320204e8bc73b59cdbd7b3da28f

tfx-taxi-cab-classification-pipeline-example-for-on-prem-cqg2mn-3535214929
This step is in Failed state with this message: failed with exit code 1
    from tensorflow_transform.coders.csv_coder import CsvCoder
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/coders/csv_coder.py", line 26, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/__init__.py", line 22, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/__init__.py", line 81, in <module>
    from tensorflow.python import keras
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/__init__.py", line 24, in <module>
    from tensorflow.python.keras import activations
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/activations/__init__.py", line 22, in <module>
    from tensorflow.python.keras._impl.keras.activations import elu
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/__init__.py", line 21, in <module>
    from tensorflow.python.keras._impl.keras import activations
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/activations.py", line 23, in <module>
    from tensorflow.python.keras._impl.keras import backend as K
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/_impl/keras/backend.py", line 37, in <module>
    from tensorflow.python.layers import base as tf_base_layers
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 25, in <module>
    from tensorflow.python.keras.engine import base_layer
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/__init__.py", line 21, in <module>
    from tensorflow.python.keras.engine.base_layer import InputSpec
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 33, in <module>
    from tensorflow.python.keras import backend
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/backend/__init__.py", line 22, in <module>
    from tensorflow.python.keras._impl.keras.backend import abs
ImportError: cannot import name abs

Thank you,
Constantinos

@jinchihe
Copy link
Member Author

Seems samples/tfx/taxi-cab-classification-pipeline.py has conflicts again.

Now I think it's better to merge the taxi-cab-classification-pipeline-on-prem.py and taxi-cab-classification-pipeline-gcp.py into one file in the PR, if so, easy maintenance and avoid code duplication. @Ark-kun @gaoning777 WDYT?

Personally I strongly suggest to merge the PR, since so many people pay attention to the PR and ask questions about running the sample on prem (local storage) by Slack and Email. :-)

@Ark-kun
Copy link
Contributor

Ark-kun commented May 15, 2019

has conflicts again

Sorry for the interruptions. There was a release yesterday and releases currently change the SHA hashes in the component paths.

Now I think it's better to merge the taxi-cab-classification-pipeline-on-prem.py and taxi-cab-classification-pipeline-gcp.py into one file in the PR, if so, easy maintenance and avoid code duplication. @Ark-kun @gaoning777 WDYT?

It would be great.
#1209 will make it much easier.

@jinchihe
Copy link
Member Author

@Ark-kun Thanks. here's what I'm thinking. Please review and comment. Thanks a lot!

Due to there are some differences for gcp and on-prem cluster for parameters and ContainerOp , such as I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp, that means each component will has different parameter between gcp and on-prem, as on-prem will add pvolumes:

    cop = dsl.ContainerOp(
        ...
        pvolumes={"/mnt": vop.volume}
    )

So here's what I'm thinking: add new function in taxi-cab-classification-pipeline.py, after the change, one function is for gcp and another function is for on-prem, as following:

...

@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example for GCP',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset for GCP.'
)
def taxi_cab_classification_gcp(
    output,
    project,
    column_names='gs://ml-pipeline-playground/tfx/taxi-cab-classification/column-names.json',

...

@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example for on-prem cluster',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset for on-prem cluster.'
)
def taxi_cab_classification_onprem(
    project='taxi-cab-classification-pipeline-on-prem',
    relative_column_names='taxi-cab-classification/column-names.json',

...

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(taxi_cab_classification_gcp, __file__ + '_gcp.zip')
    kfp.compiler.Compiler().compile(taxi_cab_classification_onprem, __file__ + '_onprem.zip')

And compile with --function, as below

  • GCP:
dsl_compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline_gcp.tar.gz --function taxi_cab_classification_gcp
  • on premise cluster.
dsl_compile --py taxi-cab-classification-pipeline.py --output taxi-cab-classification-pipeline_onprem.tar.gz --function taxi_cab_classification_onprem

@elikatsis
Copy link
Member

Hi @jinchihe,

[...] such as I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp, that means each component will has different parameter between gcp and on-prem, as on-prem will add pvolumes: [...]

Thank you for integrating with the work merged at #926 and I'm glad you find pvolumes convenient.
Is there something you have bumped into while using pvolumes? Please let me know.

I have not used components extensively, what is the easiest way to use a pvolume instance with a component? Looking at code examples, I think you'd find an .add_pvolumes() method very helpful, so you could do, for example,

vop = dsl.VolumeOp(...)
validation = dataflow_tf_data_validation_op(...).add_pvolumes({pvc_mount_path: vop.volume})
preprocess = dataflow_tf_transform_op(...).add_pvolumes({pvc_mount_path: validation.pvolume})
# or preprocess = dataflow_tf_transform_op(...).add_pvolumes({pvc_mount_path: vop.volume})
# since in this sample all the dependencies are resolved using parameters
...

I'm preparing a WIP PR so you can test.

@Ark-kun
Copy link
Contributor

Ark-kun commented May 18, 2019

I'd like to use the feature in #801 to create a pvc and download source taxi-cab-classification to the pvc automatically, and then attach the pvc to each steps by pvolumes of ContainerOp

What about creating the volume, but then using onprem.mount_pvc to mount it to the steps?

@jinchihe
Copy link
Member Author

jinchihe commented May 20, 2019

What about creating the volume, but then using onprem.mount_pvc to mount it to the steps?

@Ark-kun Thanks, This method onprem.mount_pvc is currently used. :-) If using dsl.VolumeOp to create the pvc , I think that may be better way to use .add_pvolumes method which mentioned by @elikatsis (#1353 is on the way), seems the method add_pvolumes has similar function with onprem.mount_pvc. I will try that once get chance. Thanks.

@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch 2 times, most recently from 78c5c65 to e86d246 Compare May 21, 2019 07:27
@jinchihe
Copy link
Member Author

jinchihe commented May 21, 2019

/assign @Ark-kun

I just updated the PR as following:

  1. For easy maintenance and to avoid code duplication, I merged the pipeline definition files for GCP and On-Prem cluster.
  2. For On-Prem cluster, I added two steps for creating PVC and download code automatically.
  3. For sample_config.json, I removed the changes and only keep one simple that's for GCP.
  4. Define the variable platform to switch GCP and On-Prem, by default that's for GCP, if use want to try the pipeline on On-Prem cluster, need to update simplely as README description.
  5. For the code to mount pvc and GCP use_gcp_secret, we can enhance later once the PR SDK/Compiler - Added the ability to apply a function to all ops in a pipeline #1209 merged.

Any more comments please let me know. Thanks a lot!

@jinchihe jinchihe force-pushed the update_tfx_taxi_sample_to_add_onprem branch from e86d246 to 69dac3f Compare May 21, 2019 13:40
@jinchihe
Copy link
Member Author

@Ark-kun Any more comments. I noticed the #1209 has been merged, but I'd like to keep existing implement since some steps no need to apply pvc, and I used for loop to mount that and I think the code is clear :-)

@Ark-kun
Copy link
Contributor

Ark-kun commented May 31, 2019

/lgtm
/approve

Sorry for the delays!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ark-kun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jinchihe
Copy link
Member Author

jinchihe commented Jun 1, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 895645c into kubeflow:master Jun 1, 2019
@cdyangzhenyu
Copy link

@jinchihe Hi jinchi, I was testing the TFX demo on prem today. The pipeline workflow worked fine, but the web ui has some error, I can't open the artifacts ui. I find some js error: 'Uncaught (in promise) Error: Unsupported storage path: /mnt/a07b8215-8c1a-11e9-a2ff-525400ed33aa/tfx-taxi-cab-classification-pipeline-example-q7rtq-2449399348/data/roc.csv'.
How to solve this problem? Thanks.

@jinchihe jinchihe deleted the update_tfx_taxi_sample_to_add_onprem branch June 12, 2019 09:33
@jinchihe
Copy link
Member Author

@cdyangzhenyu I think the Pipeline Web UI does not support get the artifacts from local path, but just from gs:// and s3:// etc... May be support this later. Thanks.

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
* Fix kfserving ingressgateway typo

* Add troube shooting for IngressNotConfigured error
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
* Add doc for break feature for loop

* Fix typos

Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

example TFX taxi support on-prem cluster