Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sample] Update the Doc for TFX sample #2798

Merged
merged 2 commits into from
Jan 16, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 38 additions & 12 deletions samples/core/parameterized_tfx_oss/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,59 @@
# Overview

[Tensorflow Extended (TFX)](https://github.com/tensorflow/tfx) is a Google-production-scale machine
learning platform based on TensorFlow. It provides a configuration framework to express ML pipelines
consisting of TFX components. Kubeflow Pipelines can be used as the orchestrator supporting the
execution of a TFX pipeline.

This sample demonstrates how to author a ML pipeline in TFX and run it on a KFP deployment.
Please refer to inline comments for the purpose of each step.
This directory contains two samples that demonstrate how to author a ML pipeline in TFX and run it
on a KFP deployment.
* `parameterized_tfx_oss.py` is a Python script that outputs a compiled KFP workflow, which you can
submit to a KFP deployment to run;
* `parameterized_tfx_oss.ipynb` is a notebook version of `parameterized_tfx_oss.py`, and it also
includes the guidance to setup its dependencies.

Please refer to inline comments for the purpose of each step in both samples.

In order to successfully compile this sample, you'll need to have a TFX installation at version 0.15.0
by running `python3 -m pip install tfx==0.15.0`
After that, run
# Compilation
* `parameterized_tfx_oss.py`:
In order to successfully compile the Python sample, you'll need to have a TFX installation at
version 0.15.0 by running `python3 -m pip install tfx==0.15.0`. After that, under the sample dir run
`python3 parameterized_tfx_oss.py` to compile the TFX pipeline into KFP pipeline package.
The compilation is done by invoking `kfp_runner.run(pipeline)` in the script.

# Permission
* `parameterized_tfx_oss.ipynb`:
The notebook sample includes the installation of various dependencies as its first step. Especially,
it depends on the latest released KFP and a nightly built TFX to leverage `TFX::RuntimeParameter`.

# Permission
This pipeline requires Google Cloud Storage permission to run.
If KFP was deployed through K8S marketplace, please follow instructions in [the guideline](https://github.com/kubeflow/pipelines/blob/master/manifests/gcp_marketplace/guide.md#gcp-service-account-credentials)
If KFP was deployed through K8S marketplace, please follow instructions in
[the guideline](https://github.com/kubeflow/pipelines/blob/master/manifests/gcp_marketplace/guide.md#gcp-service-account-credentials)
to make sure the service account has `storage.admin` role.
If KFP was deployed through
[standalone deployment](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize)
please refer to [Authenticating Pipelines to GCP](https://www.kubeflow.org/docs/gke/authentication-pipelines/)
to provide `storage.admin` permission.

# Execution
* `parameterized_tfx_oss.py`:
You can submit the compiled package to a KFP deployment and run it from the UI.

## Caveats
* `parameterized_tfx_oss.ipynb`:
The last step of the notebook the execution of the pipeline is invoked via KFP SDK client. Also you
have the option to submit and run from UI manually.

This sample uses pipeline parameters in a TFX pipeline, which is not yet fully supported.
## Caveats in `parameterized_tfx_oss.py`
This sample uses pipeline parameters in a TFX pipeline, which was not fully supported in TFX 0.15.0.
See [here](https://github.com/tensorflow/tfx/issues/362) for more details. In this sample, however,
the path to module file and path to data are parameterized. This is achieved by specifying those
objects `dsl.PipelineParam` and appending them to the `KubeflowDagRunner._params`. Then,
KubeflowDagRunner can correctly identify those pipeline parameters and interpret them as Argo
placeholder correctly when compilation. However, this parameterization approach is a hack and
we do not have plan for long-term support. Instead we're working with TFX team to support
pipeline parameterization using their [RuntimeParameter](https://github.com/tensorflow/tfx/blob/46bb4f975c36ea1defde4b3c33553e088b3dc5b8/tfx/orchestration/data_types.py#L108).
pipeline parameterization using their
[RuntimeParameter](https://github.com/tensorflow/tfx/blob/592e05ea544d05f28d108ab74ebca70540854917/tfx/orchestration/data_types.py#L158).
You can check out the usage of `RuntimeParameter` in the notebook sample.

### Known issues
* This approach only works for string-typed quantities. For example, you cannot parameterize
`num_steps` of `Trainer` in this way.
Expand All @@ -37,5 +62,6 @@ pipeline parameterization using their [RuntimeParameter](https://github.com/tens
* If the parameter is referenced at multiple places, the user should
make sure that it is correctly converted to the string-formatted placeholder by
calling `str(your_param)`.
* The best practice is to specify TFX pipeline root to an empty dir. In this sample Argo automatically do that by plugging in the
* The best practice is to specify TFX pipeline root to an empty dir. In this sample Argo
automatically do that by plugging in the
workflow unique ID (represented `kfp.dsl.RUN_ID_PLACEHOLDER`) to the pipeline root path.