-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple pipeline demo #322
Simple pipeline demo #322
Changes from 5 commits
4be1a9a
a164324
1be82bf
8622fb0
a9cd7a3
6bd7b7e
7f76ea9
b9f4cea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# Kubeflow demo - Simple pipeline | ||
|
||
## Hyperparameter tuning and autoprovisioning GPU nodes | ||
|
||
This repository contains a demonstration of Kubeflow capabilities, suitable for | ||
presentation to public audiences. | ||
|
||
The base demo includes the following steps: | ||
|
||
1. [Setup your environment](#1-setup-your-environment) | ||
1. [Run a simple pipeline](#2-run-a-simple-pipeline) | ||
1. [Perform hyperparameter tuning](#3-perform-hyperparameter-tuning) | ||
1. [Run a better pipeline](#4-run-a-better-pipeline) | ||
|
||
## 1. Setup your environment | ||
|
||
Follow the instructions in | ||
[demo_setup/README.md](https://github.com/kubeflow/examples/blob/master/demos/simple_pipeline/demo_setup/README.md) | ||
to setup your environment and install Kubeflow with pipelines on an | ||
autoprovisioning GKE cluster. | ||
|
||
View the installed components in the GCP Console. | ||
* In the | ||
[Kubernetes Engine](https://console.cloud.google.com/kubernetes) | ||
section, you will see a new cluster ${CLUSTER} with 3 `n1-standard-1` nodes | ||
* Under | ||
[Workloads](https://console.cloud.google.com/kubernetes/workload), | ||
you will see all the default Kubeflow and pipeline components. | ||
|
||
Source the environment file and activate the conda environment for pipelines: | ||
|
||
``` | ||
source kubeflow-demo-simple-pipeline.env | ||
source activate kfp | ||
``` | ||
|
||
## 2. Run a simple pipeline | ||
|
||
Show the file `gpu-example-pipeline.py` as an example of a simple pipeline. | ||
|
||
Compile it to create a .tar.gz file: | ||
|
||
``` | ||
./gpu-example-pipeline.py | ||
``` | ||
|
||
View the pipelines UI locally by forwarding a port to the ml-pipeline-ui pod: | ||
|
||
``` | ||
PIPELINES_POD=$(kubectl get po -l app=ml-pipeline-ui | \ | ||
grep ml-pipeline-ui | \ | ||
head -n 1 | \ | ||
cut -d " " -f 1 ) | ||
kubectl port-forward ${PIPELINES_POD} 8080:3000 | ||
``` | ||
|
||
In the browser, navigate to `localhost:8080` and create a new pipeline by | ||
uploading `gpu-example-pipeline.py.tar.gz`. Select the pipeline and click | ||
"Create experiment." Use all suggested defaults. | ||
|
||
View the effects of autoscaling by watching the number of nodes. | ||
|
||
Select "Experiments" from the left-hand side, then "Runs". Click on the job to view | ||
the graph and watch it run. | ||
|
||
Notice the low accuracy. | ||
|
||
## 3. Perform hyperparameter tuning | ||
|
||
Create a study by applying an example file to the cluster: | ||
|
||
``` | ||
kubectl apply -f gpu-example-katib.yaml | ||
``` | ||
|
||
This creates a studyjob object. To view it: | ||
|
||
``` | ||
kubectl get studyjob | ||
kubectl describe studyjobs gpu-example | ||
``` | ||
|
||
To view the Katib UI, connect to the modeldb-frontend pod: | ||
|
||
``` | ||
KATIB_POD=$(kubectl get po -l app=modeldb,component=frontend | \ | ||
grep modeldb-frontend | \ | ||
head -n 1 | \ | ||
cut -d " " -f 1 ) | ||
kubectl port-forward ${KATIB_POD} 8081:3000 | ||
``` | ||
|
||
In the browser, navigate to `localhost:8081/katib` and click on the | ||
gpu-example project. In the Explore Visualizations section, select | ||
_Optimizer_ in the _Group By_ dropdown, then click _Compare_. | ||
|
||
While you're waiting, watch autoscaling. View the pods in Pending status. | ||
|
||
View the creation of a new GPU node pool: | ||
|
||
``` | ||
gcloud container node-pools list --cluster ${CLUSTER} | ||
``` | ||
|
||
View the creation of new nodes: | ||
|
||
``` | ||
kubectl get nodes | ||
``` | ||
|
||
Determine which combination of hyperparameters results in the highest accuracy. | ||
|
||
## 4. Run a better pipeline | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe something like "next steps" for the section title and a little lead-in in the prose, e.g. "now that we've found some good hyperparameters we're ready to ..." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a bit of text to clarify the point of the transition |
||
|
||
In the pipelines UI, clone the previous job and update the arguments. Run the | ||
pipeline and watch for the resulting accuracy. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,183 @@ | ||
# Kubeflow demo - Simple pipeline | ||
|
||
This repository contains a demonstration of Kubeflow capabilities, suitable for | ||
presentation to public audiences. | ||
|
||
The base demo includes the following steps: | ||
|
||
1. [Setup your environment](#1-setup-your-environment) | ||
1. [Create a GKE cluster and install Kubeflow](#2-create-a-gke-cluster-and-install-kubeflow) | ||
1. [Install pipelines on GKE](#3-install-pipelines-on-gke) | ||
|
||
## 1. Setup your environment | ||
|
||
Set environment variables by sourcing the env file: | ||
|
||
``` | ||
. kubeflow-demo-simple-pipeline.env | ||
``` | ||
|
||
Create a clean python environment for installing Kubeflow Pipelines: | ||
|
||
``` | ||
conda create --name kfp python=3.6 | ||
source activate kfp | ||
``` | ||
|
||
Install the Kubeflow Pipelines SDK: | ||
|
||
``` | ||
pip install https://storage.googleapis.com/ml-pipeline/release/0.0.26/kfp-0.0.26.tar.gz --upgrade | ||
``` | ||
|
||
## 2. Create a GKE cluster and install Kubeflow | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess there's a convenience to the user vs. maintainability tradeoff here. It's more convenient for the docs for launching kubeflow to be right here but it presents a maintainability challenge to have that documentation replicated in numerous places instead of being centralized. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wrestled with this question and finally settled on adding this here for now. Early intentions were to have a single demo_setup directory in the root dir of demos, but the problem is that it can grow large and is hard to maintain. I prefer having a smaller number of setup steps that exactly matches each demo, but that comes with maintenance challenges. It's not a straightforward call and I'm open to other approaches. Good unit test coverage is our best defense here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure it's your preference on that then. |
||
|
||
Creating a cluster with click-to-deploy does not yet support the installation of | ||
pipelines. It is not useful for demonstrating pipelines, but is still worth showing. | ||
|
||
### Click-to-deploy | ||
|
||
Generate a web app Client ID and Client Secret by following the instructions | ||
[here](https://www.kubeflow.org/docs/started/getting-started-gke/#create-oauth-client-credentials). | ||
Save these as environment variables for easy access. | ||
|
||
In the browser, navigate to the | ||
[Click-to-deploy app](https://deploy.kubeflow.cloud/). Enter the project name, | ||
along with the Client ID and Client Secret previously generated. Select the | ||
desired ${ZONE} and latest version of Kubeflow, then click _Create Deployment_. | ||
|
||
In the [GCP Console](https://console.cloud.google.com/kubernetes), navigate to the | ||
Kubernetes Engine panel to watch the cluster creation process. This results in a | ||
full cluster with Kubeflow installed. | ||
|
||
### kfctl | ||
|
||
While node autoprovisioning is in beta, it must be enabled manually. To create | ||
a cluster with autoprovisioning, run the following commands, which will take | ||
around 30 minutes: | ||
|
||
``` | ||
gcloud container clusters create ${CLUSTER} \ | ||
--project ${DEMO_PROJECT} \ | ||
--zone ${ZONE} \ | ||
--cluster-version 1.11.2-gke.9 \ | ||
--num-nodes=8 \ | ||
--scopes cloud-platform,compute-rw,storage-rw \ | ||
--verbosity error | ||
|
||
# scale down cluster to 3 (initial 8 is just to prevent master restarts due to upscaling) | ||
# we cannot use 0 because then cluster autoscaler treats the cluster as unhealthy. | ||
# Also having a few small non-gpu nodes is needed to handle system pods | ||
gcloud container clusters resize ${CLUSTER} \ | ||
--project ${DEMO_PROJECT} \ | ||
--zone ${ZONE} \ | ||
--size=3 \ | ||
--node-pool=default-pool | ||
|
||
# enable node auto provisioning | ||
gcloud beta container clusters update ${CLUSTER} \ | ||
--project ${DEMO_PROJECT} \ | ||
--zone ${ZONE} \ | ||
--enable-autoprovisioning \ | ||
--max-cpu 20 \ | ||
--max-memory 200 \ | ||
--max-accelerator=type=nvidia-tesla-k80,count=8 | ||
``` | ||
|
||
Once the cluster has been created, install GPU drivers: | ||
|
||
``` | ||
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml | ||
``` | ||
|
||
Add RBAC permissions, which allows your user to install kubeflow components on | ||
the cluster: | ||
|
||
``` | ||
kubectl create clusterrolebinding cluster-admin-binding-${USER} \ | ||
--clusterrole cluster-admin \ | ||
--user $(gcloud config get-value account) | ||
``` | ||
|
||
Setup kubectl access: | ||
|
||
``` | ||
kubectl create namespace kubeflow | ||
./create_context.sh gke ${NAMESPACE} | ||
``` | ||
|
||
Setup OAuth environment variables ${CLIENT_ID} and ${CLIENT_SECRET} using the | ||
instructions | ||
[here](https://www.kubeflow.org/docs/started/getting-started-gke/#create-oauth-client-credentials). | ||
|
||
``` | ||
kubectl create secret generic kubeflow-oauth --from-literal=client_id=${CLIENT_ID} --from-literal=client_secret=${CLIENT_SECRET} | ||
``` | ||
|
||
Create service accounts, add permissions, download credentials, and create secrets: | ||
|
||
``` | ||
ADMIN_EMAIL=${CLUSTER}-admin@${PROJECT}.iam.gserviceaccount.com | ||
USER_EMAIL=${CLUSTER}-user@${PROJECT}.iam.gserviceaccount.com | ||
ADMIN_FILE=${HOME}/.ssh/${ADMIN_EMAIL}.json | ||
USER_FILE=${HOME}/.ssh/${ADMIN_EMAIL}.json | ||
|
||
gcloud iam service-accounts create ${CLUSTER}-admin --display-name=${CLUSTER}-admin | ||
gcloud iam service-accounts create ${CLUSTER}-user --display-name=${CLUSTER}-user | ||
|
||
gcloud projects add-iam-policy-binding ${PROJECT} \ | ||
--member=serviceAccount:${ADMIN_EMAIL} \ | ||
--role=roles/storage.admin | ||
gcloud projects add-iam-policy-binding ${PROJECT} \ | ||
--member=serviceAccount:${USER_EMAIL} \ | ||
--role=roles/storage.admin | ||
|
||
gcloud iam service-accounts keys create ${ADMIN_FILE} \ | ||
--project ${PROJECT} \ | ||
--iam-account ${ADMIN_EMAIL} | ||
gcloud iam service-accounts keys create ${USER_FILE} \ | ||
--project ${PROJECT} \ | ||
--iam-account ${USER_EMAIL} | ||
|
||
kubectl create secret generic admin-gcp-sa \ | ||
--from-file=admin-gcp-sa.json=${ADMIN_FILE} | ||
kubectl create secret generic user-gcp-sa \ | ||
--from-file=user-gcp-sa.json=${USER_FILE} | ||
``` | ||
|
||
Install kubeflow with the following commands: | ||
|
||
``` | ||
kfctl init ${CLUSTER} --platform gcp | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious - why not just use kfctl to create the GKE cluster? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This demo highlights autoprovisioning, which is a beta feature not included in kfctl or click-to-deploy. It also includes pipelines, which needs a bit of work on access permissions in order to be included as part of kfctl. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The plan for 0.4.0 is to include pipelines by default in kubeflow deployments, so hopefully this will simplify the process. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't wait for that day 💃 |
||
cd ${CLUSTER} | ||
kfctl generate k8s | ||
kfctl apply k8s | ||
``` | ||
|
||
Patch some outdated katib artifacts: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably fix this, instead of telling users to patch their clusters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed! I'll add updates to PR #1904 & Issue #1903 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's not block this PR waiting for a fix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR is merged, do we still need this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we include those changes in an 0.3 patch? I would like to be able to specify a version. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Amazing 💯 Thanks!! |
||
|
||
``` | ||
cd ${DEMO_REPO} | ||
kubectl delete configmap worker-template | ||
kubectl apply -f workerConfigMap.yaml | ||
``` | ||
|
||
## 3. Install pipelines on GKE | ||
|
||
``` | ||
kubectl create clusterrolebinding sa-admin --clusterrole=cluster-admin --serviceaccount=kubeflow:pipeline-runner | ||
cd ks_app | ||
ks registry add ml-pipeline "${PIPELINES_REPO}/ml-pipeline" | ||
ks pkg install ml-pipeline/ml-pipeline | ||
ks generate ml-pipeline ml-pipeline | ||
ks param set ml-pipeline namespace kubeflow | ||
ks apply default -c ml-pipeline | ||
``` | ||
|
||
View the installed components in the GCP Console. In the | ||
[Kubernetes Engine](https://console.cloud.google.com/kubernetes) | ||
section, you will see a new cluster ${CLUSTER}. Under | ||
[Workloads](https://console.cloud.google.com/kubernetes/workload), | ||
you will see all the default Kubeflow and pipeline components. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
apiVersion: "kubeflow.org/v1alpha1" | ||
kind: StudyJob | ||
metadata: | ||
namespace: kubeflow | ||
labels: | ||
controller-tools.k8s.io: "1.0" | ||
name: gpu-example | ||
spec: | ||
studyName: gpu-example | ||
owner: crd | ||
optimizationtype: maximize | ||
objectivevaluename: Validation-accuracy | ||
optimizationgoal: 0.99 | ||
metricsnames: | ||
- accuracy | ||
parameterconfigs: | ||
- name: --lr | ||
parametertype: double | ||
feasible: | ||
min: "0.01" | ||
max: "0.03" | ||
- name: --num-layers | ||
parametertype: int | ||
feasible: | ||
min: "2" | ||
max: "3" | ||
- name: --optimizer | ||
parametertype: categorical | ||
feasible: | ||
list: | ||
- sgd | ||
- adam | ||
- ftrl | ||
workerSpec: | ||
goTemplate: | ||
templatePath: "/worker-template/gpuWorkerTemplate.yaml" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So how this Katib job knows which training job to run? Is it somehow referencing the pipeline job? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The katib component includes a configmap. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. It's not obvious from the file name (gpuWorkerTemplate.yaml) that the template references a mnist mxnet example. |
||
suggestionSpec: | ||
suggestionAlgorithm: "random" | ||
requestNumber: 3 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
#!/usr/bin/env python3 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like a super convenient way to build pipelines!! It looks like this generates a tgz people upload to the Argo UI. Does this also generate the pipeline YAML in this directory? If not what is the relevance of the YAML that's included (perhaps as a comparison of the harder way of specifying a pipeline)? Guessing it's the former. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can upload the .tar.gz file directly, but in this case I included a yaml with resource requests for GPUs. Support for this via python is in the works by @qimingj. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep. Supporting to GPU is coming soon. |
||
import kfp.dsl as kfp | ||
|
||
def training_op(learning_rate: float, | ||
num_layers: int, | ||
optimizer='ftrl', | ||
step_name='training'): | ||
return kfp.ContainerOp( | ||
name=step_name, | ||
image='katib/mxnet-mnist-example', | ||
command=['python', '/mxnet/example/image-classification/train_mnist.py'], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a more interesting thing we can do in postprocessing rather than just echo? For example, push the model for serving? copy the model to somewhere? running a batch prediction? Convert the model to tf? Of course we can expand the pipeline later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't want to invest more effort in this pipeline since it's not really what we want to be showing. I would rather use one of the better examples, but to do that we need katib support for tf-job, which @richardsliu is looking into. Pipeline DSL support for katib would round things out to turn this into a much smoother demo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SG. |
||
arguments=[ | ||
'--batch-size', '64', | ||
'--lr', learning_rate, | ||
'--num-layers', num_layers, | ||
'--optimizer', optimizer | ||
], | ||
file_outputs={'output': '/etc/timezone'} | ||
) | ||
|
||
def postprocessing_op(output, | ||
step_name='postprocessing'): | ||
return kfp.ContainerOp( | ||
name=step_name, | ||
image='library/bash:4.4.23', | ||
command=['sh', '-c'], | ||
arguments=['echo "%s"' % output] | ||
) | ||
|
||
@kfp.pipeline( | ||
name='Pipeline GPU Example', | ||
description='Demonstrate the Kubeflow pipelines SDK with GPUs' | ||
) | ||
|
||
def kubeflow_training( | ||
learning_rate: kfp.PipelineParam = kfp.PipelineParam(name='learningrate', value=0.1), | ||
num_layers: kfp.PipelineParam = kfp.PipelineParam(name='numlayers', value='2'), | ||
optimizer: kfp.PipelineParam = kfp.PipelineParam(name='optimizer', value='ftrl')): | ||
|
||
training = training_op(learning_rate, num_layers, optimizer) | ||
postprocessing = postprocessing_op(training.output) # pylint: disable=unused-variable | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The fact that this pipeline is specified in python would make this especially easy to unit test. Up to you whether that's part of this PR. But can the means of triggering the pipeline run given the output of this script be programmatic? Can we consume a status code for the resulting pipeline run? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very nice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @texasmichelle So would it be reasonable to use this mechanism to test the pipeline/example or should that be left for the future? |
||
if __name__ == '__main__': | ||
import kfp.compiler as compiler | ||
compiler.Compiler().compile(kubeflow_training, __file__ + '.tar.gz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like the docs skip a step at this point - are people expected to have written the katib job spec manually (i.e. where is it coming from)? Would it be appropriate to have a kubeflow/pipelines op for launching a katib studyjob? I would find this really convenient. But perhaps beyond the current scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to the source of the gpu example file. I'm not sure how to apply a katib manifest using a pipeline step - @vicaire @qimingj do you know if that is supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not supported in pipeline DSL, but it is supported in argo yaml since argo supports any K8s template, not just container spec. We can add Katib to DSL support (such as a KatibOp). In order to do that, ideally Katib's CRD should return some output in its job status (available via kubectl get) so argo can pick it up as output (with a JSON path to the field), and then the job output can be passed to downstream components. We should discuss this.