Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple pipeline demo #322

Merged
merged 8 commits into from
Nov 16, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions demos/simple_pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Kubeflow demo - Simple pipeline

## Hyperparameter tuning and autoprovisioning GPU nodes

This repository contains a demonstration of Kubeflow capabilities, suitable for
presentation to public audiences.

The base demo includes the following steps:

1. [Setup your environment](#1-setup-your-environment)
1. [Run a simple pipeline](#2-run-a-simple-pipeline)
1. [Perform hyperparameter tuning](#3-perform-hyperparameter-tuning)
1. [Run a better pipeline](#4-run-a-better-pipeline)

## 1. Setup your environment

Follow the instructions in
[demo_setup/README.md](https://github.com/kubeflow/examples/blob/master/demos/simple_pipeline/demo_setup/README.md)
to setup your environment and install Kubeflow with pipelines on an
autoprovisioning GKE cluster.

View the installed components in the GCP Console.
* In the
[Kubernetes Engine](https://console.cloud.google.com/kubernetes)
section, you will see a new cluster ${CLUSTER} with 3 `n1-standard-1` nodes
* Under
[Workloads](https://console.cloud.google.com/kubernetes/workload),
you will see all the default Kubeflow and pipeline components.

Source the environment file and activate the conda environment for pipelines:

```
source kubeflow-demo-simple-pipeline.env
source activate kfp
```

## 2. Run a simple pipeline

Show the file `gpu-example-pipeline.py` as an example of a simple pipeline.

Compile it to create a .tar.gz file:

```
./gpu-example-pipeline.py
```

View the pipelines UI locally by forwarding a port to the ml-pipeline-ui pod:

```
PIPELINES_POD=$(kubectl get po -l app=ml-pipeline-ui | \
grep ml-pipeline-ui | \
head -n 1 | \
cut -d " " -f 1 )
kubectl port-forward ${PIPELINES_POD} 8080:3000
```

In the browser, navigate to `localhost:8080` and create a new pipeline by
uploading `gpu-example-pipeline.py.tar.gz`. Select the pipeline and click
"Create experiment." Use all suggested defaults.

View the effects of autoscaling by watching the number of nodes.

Select "Experiments" from the left-hand side, then "Runs". Click on the job to view
the graph and watch it run.

Notice the low accuracy.

## 3. Perform hyperparameter tuning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like the docs skip a step at this point - are people expected to have written the katib job spec manually (i.e. where is it coming from)? Would it be appropriate to have a kubeflow/pipelines op for launching a katib studyjob? I would find this really convenient. But perhaps beyond the current scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to the source of the gpu example file. I'm not sure how to apply a katib manifest using a pipeline step - @vicaire @qimingj do you know if that is supported?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not supported in pipeline DSL, but it is supported in argo yaml since argo supports any K8s template, not just container spec. We can add Katib to DSL support (such as a KatibOp). In order to do that, ideally Katib's CRD should return some output in its job status (available via kubectl get) so argo can pick it up as output (with a JSON path to the field), and then the job output can be passed to downstream components. We should discuss this.


Create a study by applying an example file to the cluster:

```
kubectl apply -f gpu-example-katib.yaml
```

This creates a studyjob object. To view it:

```
kubectl get studyjob
kubectl describe studyjobs gpu-example
```

To view the Katib UI, connect to the modeldb-frontend pod:

```
KATIB_POD=$(kubectl get po -l app=modeldb,component=frontend | \
grep modeldb-frontend | \
head -n 1 | \
cut -d " " -f 1 )
kubectl port-forward ${KATIB_POD} 8081:3000
```

In the browser, navigate to `localhost:8081/katib` and click on the
gpu-example project. In the Explore Visualizations section, select
_Optimizer_ in the _Group By_ dropdown, then click _Compare_.

While you're waiting, watch autoscaling. View the pods in Pending status.

View the creation of a new GPU node pool:

```
gcloud container node-pools list --cluster ${CLUSTER}
```

View the creation of new nodes:

```
kubectl get nodes
```

Determine which combination of hyperparameters results in the highest accuracy.

## 4. Run a better pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like "next steps" for the section title and a little lead-in in the prose, e.g. "now that we've found some good hyperparameters we're ready to ..."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit of text to clarify the point of the transition


In the pipelines UI, clone the previous job and update the arguments. Run the
pipeline and watch for the resulting accuracy.


183 changes: 183 additions & 0 deletions demos/simple_pipeline/demo_setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Kubeflow demo - Simple pipeline

This repository contains a demonstration of Kubeflow capabilities, suitable for
presentation to public audiences.

The base demo includes the following steps:

1. [Setup your environment](#1-setup-your-environment)
1. [Create a GKE cluster and install Kubeflow](#2-create-a-gke-cluster-and-install-kubeflow)
1. [Install pipelines on GKE](#3-install-pipelines-on-gke)

## 1. Setup your environment

Set environment variables by sourcing the env file:

```
. kubeflow-demo-simple-pipeline.env
```

Create a clean python environment for installing Kubeflow Pipelines:

```
conda create --name kfp python=3.6
source activate kfp
```

Install the Kubeflow Pipelines SDK:

```
pip install https://storage.googleapis.com/ml-pipeline/release/0.0.26/kfp-0.0.26.tar.gz --upgrade
```

## 2. Create a GKE cluster and install Kubeflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's a convenience to the user vs. maintainability tradeoff here. It's more convenient for the docs for launching kubeflow to be right here but it presents a maintainability challenge to have that documentation replicated in numerous places instead of being centralized. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrestled with this question and finally settled on adding this here for now. Early intentions were to have a single demo_setup directory in the root dir of demos, but the problem is that it can grow large and is hard to maintain. I prefer having a smaller number of setup steps that exactly matches each demo, but that comes with maintenance challenges. It's not a straightforward call and I'm open to other approaches. Good unit test coverage is our best defense here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure it's your preference on that then.


Creating a cluster with click-to-deploy does not yet support the installation of
pipelines. It is not useful for demonstrating pipelines, but is still worth showing.

### Click-to-deploy

Generate a web app Client ID and Client Secret by following the instructions
[here](https://www.kubeflow.org/docs/started/getting-started-gke/#create-oauth-client-credentials).
Save these as environment variables for easy access.

In the browser, navigate to the
[Click-to-deploy app](https://deploy.kubeflow.cloud/). Enter the project name,
along with the Client ID and Client Secret previously generated. Select the
desired ${ZONE} and latest version of Kubeflow, then click _Create Deployment_.

In the [GCP Console](https://console.cloud.google.com/kubernetes), navigate to the
Kubernetes Engine panel to watch the cluster creation process. This results in a
full cluster with Kubeflow installed.

### kfctl

While node autoprovisioning is in beta, it must be enabled manually. To create
a cluster with autoprovisioning, run the following commands, which will take
around 30 minutes:

```
gcloud container clusters create ${CLUSTER} \
--project ${DEMO_PROJECT} \
--zone ${ZONE} \
--cluster-version 1.11.2-gke.9 \
--num-nodes=8 \
--scopes cloud-platform,compute-rw,storage-rw \
--verbosity error

# scale down cluster to 3 (initial 8 is just to prevent master restarts due to upscaling)
# we cannot use 0 because then cluster autoscaler treats the cluster as unhealthy.
# Also having a few small non-gpu nodes is needed to handle system pods
gcloud container clusters resize ${CLUSTER} \
--project ${DEMO_PROJECT} \
--zone ${ZONE} \
--size=3 \
--node-pool=default-pool

# enable node auto provisioning
gcloud beta container clusters update ${CLUSTER} \
--project ${DEMO_PROJECT} \
--zone ${ZONE} \
--enable-autoprovisioning \
--max-cpu 20 \
--max-memory 200 \
--max-accelerator=type=nvidia-tesla-k80,count=8
```

Once the cluster has been created, install GPU drivers:

```
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml
```

Add RBAC permissions, which allows your user to install kubeflow components on
the cluster:

```
kubectl create clusterrolebinding cluster-admin-binding-${USER} \
--clusterrole cluster-admin \
--user $(gcloud config get-value account)
```

Setup kubectl access:

```
kubectl create namespace kubeflow
./create_context.sh gke ${NAMESPACE}
```

Setup OAuth environment variables ${CLIENT_ID} and ${CLIENT_SECRET} using the
instructions
[here](https://www.kubeflow.org/docs/started/getting-started-gke/#create-oauth-client-credentials).

```
kubectl create secret generic kubeflow-oauth --from-literal=client_id=${CLIENT_ID} --from-literal=client_secret=${CLIENT_SECRET}
```

Create service accounts, add permissions, download credentials, and create secrets:

```
ADMIN_EMAIL=${CLUSTER}-admin@${PROJECT}.iam.gserviceaccount.com
USER_EMAIL=${CLUSTER}-user@${PROJECT}.iam.gserviceaccount.com
ADMIN_FILE=${HOME}/.ssh/${ADMIN_EMAIL}.json
USER_FILE=${HOME}/.ssh/${ADMIN_EMAIL}.json

gcloud iam service-accounts create ${CLUSTER}-admin --display-name=${CLUSTER}-admin
gcloud iam service-accounts create ${CLUSTER}-user --display-name=${CLUSTER}-user

gcloud projects add-iam-policy-binding ${PROJECT} \
--member=serviceAccount:${ADMIN_EMAIL} \
--role=roles/storage.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
--member=serviceAccount:${USER_EMAIL} \
--role=roles/storage.admin

gcloud iam service-accounts keys create ${ADMIN_FILE} \
--project ${PROJECT} \
--iam-account ${ADMIN_EMAIL}
gcloud iam service-accounts keys create ${USER_FILE} \
--project ${PROJECT} \
--iam-account ${USER_EMAIL}

kubectl create secret generic admin-gcp-sa \
--from-file=admin-gcp-sa.json=${ADMIN_FILE}
kubectl create secret generic user-gcp-sa \
--from-file=user-gcp-sa.json=${USER_FILE}
```

Install kubeflow with the following commands:

```
kfctl init ${CLUSTER} --platform gcp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - why not just use kfctl to create the GKE cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This demo highlights autoprovisioning, which is a beta feature not included in kfctl or click-to-deploy. It also includes pipelines, which needs a bit of work on access permissions in order to be included as part of kfctl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan for 0.4.0 is to include pipelines by default in kubeflow deployments, so hopefully this will simplify the process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't wait for that day 💃

cd ${CLUSTER}
kfctl generate k8s
kfctl apply k8s
```

Patch some outdated katib artifacts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably fix this, instead of telling users to patch their clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'll add updates to PR #1904 & Issue #1903

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not block this PR waiting for a fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is merged, do we still need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include those changes in an 0.3 patch? I would like to be able to specify a version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 💯 Thanks!!


```
cd ${DEMO_REPO}
kubectl delete configmap worker-template
kubectl apply -f workerConfigMap.yaml
```

## 3. Install pipelines on GKE

```
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runner
cd ks_app
ks registry add ml-pipeline "${PIPELINES_REPO}/ml-pipeline"
ks pkg install ml-pipeline/ml-pipeline
ks generate ml-pipeline ml-pipeline
ks param set ml-pipeline namespace kubeflow
ks apply default -c ml-pipeline
```

View the installed components in the GCP Console. In the
[Kubernetes Engine](https://console.cloud.google.com/kubernetes)
section, you will see a new cluster ${CLUSTER}. Under
[Workloads](https://console.cloud.google.com/kubernetes/workload),
you will see all the default Kubeflow and pipeline components.


39 changes: 39 additions & 0 deletions demos/simple_pipeline/gpu-example-katib.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: "kubeflow.org/v1alpha1"
kind: StudyJob
metadata:
namespace: kubeflow
labels:
controller-tools.k8s.io: "1.0"
name: gpu-example
spec:
studyName: gpu-example
owner: crd
optimizationtype: maximize
objectivevaluename: Validation-accuracy
optimizationgoal: 0.99
metricsnames:
- accuracy
parameterconfigs:
- name: --lr
parametertype: double
feasible:
min: "0.01"
max: "0.03"
- name: --num-layers
parametertype: int
feasible:
min: "2"
max: "3"
- name: --optimizer
parametertype: categorical
feasible:
list:
- sgd
- adam
- ftrl
workerSpec:
goTemplate:
templatePath: "/worker-template/gpuWorkerTemplate.yaml"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how this Katib job knows which training job to run? Is it somehow referencing the pipeline job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The katib component includes a configmap.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's not obvious from the file name (gpuWorkerTemplate.yaml) that the template references a mnist mxnet example.

suggestionSpec:
suggestionAlgorithm: "random"
requestNumber: 3
46 changes: 46 additions & 0 deletions demos/simple_pipeline/gpu-example-pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/usr/bin/env python3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a super convenient way to build pipelines!! It looks like this generates a tgz people upload to the Argo UI. Does this also generate the pipeline YAML in this directory? If not what is the relevance of the YAML that's included (perhaps as a comparison of the harder way of specifying a pipeline)? Guessing it's the former.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can upload the .tar.gz file directly, but in this case I included a yaml with resource requests for GPUs. Support for this via python is in the works by @qimingj.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. Supporting to GPU is coming soon.

import kfp.dsl as kfp

def training_op(learning_rate: float,
num_layers: int,
optimizer='ftrl',
step_name='training'):
return kfp.ContainerOp(
name=step_name,
image='katib/mxnet-mnist-example',
command=['python', '/mxnet/example/image-classification/train_mnist.py'],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more interesting thing we can do in postprocessing rather than just echo? For example, push the model for serving? copy the model to somewhere? running a batch prediction? Convert the model to tf? Of course we can expand the pipeline later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to invest more effort in this pipeline since it's not really what we want to be showing. I would rather use one of the better examples, but to do that we need katib support for tf-job, which @richardsliu is looking into. Pipeline DSL support for katib would round things out to turn this into a much smoother demo.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG.

arguments=[
'--batch-size', '64',
'--lr', learning_rate,
'--num-layers', num_layers,
'--optimizer', optimizer
],
file_outputs={'output': '/etc/timezone'}
)

def postprocessing_op(output,
step_name='postprocessing'):
return kfp.ContainerOp(
name=step_name,
image='library/bash:4.4.23',
command=['sh', '-c'],
arguments=['echo "%s"' % output]
)

@kfp.pipeline(
name='Pipeline GPU Example',
description='Demonstrate the Kubeflow pipelines SDK with GPUs'
)

def kubeflow_training(
learning_rate: kfp.PipelineParam = kfp.PipelineParam(name='learningrate', value=0.1),
num_layers: kfp.PipelineParam = kfp.PipelineParam(name='numlayers', value='2'),
optimizer: kfp.PipelineParam = kfp.PipelineParam(name='optimizer', value='ftrl')):

training = training_op(learning_rate, num_layers, optimizer)
postprocessing = postprocessing_op(training.output) # pylint: disable=unused-variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this pipeline is specified in python would make this especially easy to unit test. Up to you whether that's part of this PR. But can the means of triggering the pipeline run given the output of this script be programmatic? Can we consume a status code for the resulting pipeline run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIs for running pipelines are included - a good example is here, which @vicaire showed in this morning's community meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@texasmichelle So would it be reasonable to use this mechanism to test the pipeline/example or should that be left for the future?

if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(kubeflow_training, __file__ + '.tar.gz')
Loading