Skip to content

Commit

Permalink
[Sample] CI Sample: Kaggle (kubeflow#3021)
Browse files Browse the repository at this point in the history
* kaggle sample

* code path

* fix typo

* visualize table component

* visualize html

* train model step

* submit result

* real image

* fix typo

* push before use

* sed to replace image in component.yaml

* general instructions

* typos; more robust; better code style

* notice about gcp sa and workload identity choice
  • Loading branch information
dldaisy authored and Jeffwan committed Dec 9, 2020
1 parent 3cb183e commit 73f20d5
Show file tree
Hide file tree
Showing 19 changed files with 745 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Kaggle Competition Pipeline Sample

## Pipeline Overview

This is a pipeline for [house price prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques), an entry-level competition in kaggle. We demonstrate how to complete a kaggle competition by creating a pipeline of steps including downloading data, preprocessing and visualizing data, train model and submitting results to kaggle website.

* We refer to [the notebook by Raj Kumar Gupta](https://www.kaggle.com/rajgupta5/house-price-prediction) and [the notebook by Sergei Neviadomski](https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn) in terms of model implementation as well as data visualization.

* We use [kaggle python api](https://github.com/Kaggle/kaggle-api) to interact with kaggle site, such as downloading data and submiting result. More usage can be found in their documentation.

* We use [cloud build](https://cloud.google.com/cloud-build/) for CI process. That is, we automatically triggered a build and run as soon as we pushed our code to github repo. You need to setup a trigger on cloud build for your github repo branch to achieve the CI process.

## Notice
* You can authenticate to gcp services by either: Create a "user-gcp-sa" secret following the troubleshooting parts in [Kubeflow pipeline repo](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), or configure workload identity as instructed in [this guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). This sample uses the first method, but this will soon be deprecated. We would recommend using second method to replace the use of "user-gcp-sa" service account in the future.

## Usage

* Substitute the constants in "substitutions" in cloudbuild.yaml
* Fill in your kaggle_username and kaggle_key in Dockerfiles(in the folder "download_dataset" and "submit_result") to authenticate to kaggle. You can get them from an API token created from your kaggle "My Account" page.
* Set up cloud build triggers to your github repo for Continuous Integration
* Replace the CLOUDSDK_COMPUTE_ZONE, CLOUDSDK_CONTAINER_CLUSTER in cloudbuild.yaml with your own zone and cluster
* Enable "Kubernetes Engine Developer" in cloud build setting
* Set your gs bucket public or grant cloud storage access to cloud build and kubeflow pipeline
* Try commit and push it to github repo
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
steps:
- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"${_GCR_PATH}/kaggle_download:$COMMIT_SHA",
"-t",
"${_GCR_PATH}/kaggle_download:latest",
"${_CODE_PATH}/download_dataset",
"-f",
"${_CODE_PATH}/download_dataset/Dockerfile",
]
id: "BuildDownloadDataImage"

- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"${_GCR_PATH}/kaggle_download:$COMMIT_SHA",
]
id: "PushDownloadDataImage"
waitFor: ["BuildDownloadDataImage"]

- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA",
"-t",
"${_GCR_PATH}/kaggle_visualize_table:latest",
"${_CODE_PATH}/visualize_table",
"-f",
"${_CODE_PATH}/visualize_table/Dockerfile",
]
id: "BuildVisualizeTableImage"

- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA",
]
id: "PushVisualizeTableImage"
waitFor: ["BuildVisualizeTableImage"]

- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA",
"-t",
"${_GCR_PATH}/kaggle_visualize_html:latest",
"${_CODE_PATH}/visualize_html",
"-f",
"${_CODE_PATH}/visualize_html/Dockerfile",
]
id: "BuildVisualizeHTMLImage"

- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA",
]
id: "PushVisualizeHTMLImage"
waitFor: ["BuildVisualizeHTMLImage"]

- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"${_GCR_PATH}/kaggle_train:$COMMIT_SHA",
"-t",
"${_GCR_PATH}/kaggle_train:latest",
"${_CODE_PATH}/train_model",
"-f",
"${_CODE_PATH}/train_model/Dockerfile",
]
id: "BuildTrainImage"

- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"${_GCR_PATH}/kaggle_train:$COMMIT_SHA",
]
id: "PushTrainImage"
waitFor: ["BuildTrainImage"]

- name: "gcr.io/cloud-builders/docker"
args:
[
"build",
"-t",
"${_GCR_PATH}/kaggle_submit:$COMMIT_SHA",
"-t",
"${_GCR_PATH}/kaggle_submit:latest",
"${_CODE_PATH}/submit_result",
"-f",
"${_CODE_PATH}/submit_result/Dockerfile",
]
id: "BuildSubmitImage"

- name: "gcr.io/cloud-builders/docker"
args:
[
"push",
"${_GCR_PATH}/kaggle_submit:$COMMIT_SHA",
]
id: "PushSubmitImage"
waitFor: ["BuildSubmitImage"]

- name: "python:3.7-slim"
entrypoint: "/bin/sh"
args: [
"-c",
"set -ex;
cd ${_CODE_PATH};
pip3 install cffi==1.12.3 --upgrade;
pip3 install kfp==0.1.38;
sed -i 's|image: download_image_location|image: ${_GCR_PATH}/kaggle_download:$COMMIT_SHA|g' ./download_dataset/component.yaml;
sed -i 's|image: visualizetable_image_location|image: ${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA|g' ./visualize_table/component.yaml;
sed -i 's|image: visualizehtml_image_location|image: ${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA|g' ./visualize_html/component.yaml;
sed -i 's|image: train_image_location|image: ${_GCR_PATH}/kaggle_train:$COMMIT_SHA|g' ./train_model/component.yaml;
sed -i 's|image: submit_image_location|image: ${_GCR_PATH}/kaggle_submit:$COMMIT_SHA|g' ./submit_result/component.yaml;
python pipeline.py
--gcr_address ${_GCR_PATH};
cp pipeline.py.zip /workspace/pipeline.zip",
]
id: "KagglePackagePipeline"

- name: "gcr.io/cloud-builders/gsutil"
args:
[
"cp",
"/workspace/pipeline.zip",
"${_GS_BUCKET}/$COMMIT_SHA/pipeline.zip"
]
id: "KaggleUploadPipeline"
waitFor: ["KagglePackagePipeline"]


- name: "gcr.io/cloud-builders/kubectl"
entrypoint: "/bin/sh"
args: [
"-c",
"cd ${_CODE_PATH};
apt-get update;
apt-get install -y python3-pip;
apt-get install -y libssl-dev libffi-dev;
/builder/kubectl.bash;
pip3 install kfp;
pip3 install kubernetes;
python3 create_pipeline_version_and_run.py
--pipeline_id ${_PIPELINE_ID}
--commit_sha $COMMIT_SHA
--bucket_name ${_GS_BUCKET}
--gcr_address ${_GCR_PATH}"
]
env:
- "CLOUDSDK_COMPUTE_ZONE=[Your cluster zone, for example: us-central1-a]"
- "CLOUDSDK_CONTAINER_CLUSTER=[Your cluster name, for example: my-cluster]"
id: "KaggleCreatePipelineVersionAndRun"

images:
- "${_GCR_PATH}/kaggle_download:latest"
- "${_GCR_PATH}/kaggle_visualize_table:latest"
- "${_GCR_PATH}/kaggle_visualize_html:latest"
- "${_GCR_PATH}/kaggle_train:latest"
- "${_GCR_PATH}/kaggle_submit:latest"


substitutions:
_CODE_PATH: /workspace/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample
_NAMESPACE: kubeflow
_GCR_PATH: [Your cloud registry path. For example, gcr.io/my-project-id]
_GS_BUCKET: [Name of your cloud storage bucket. For example, gs://my-project-bucket]
_PIPELINE_ID: [Your kubeflow pipeline id to create a version on. Get it from Kubeflow Pipeline UI.
For example, f6f8558a-6eec-4ef4-b343-a650473ee613]
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import kfp
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--commit_sha', help='Required. Commit SHA, for version name. Must be unique.', type=str)
parser.add_argument('--pipeline_id', help = 'Required. pipeline id',type=str)
parser.add_argument('--bucket_name', help='Required. gs bucket to store files', type=str)
parser.add_argument('--gcr_address', help='Required. Cloud registry address. For example, gcr.io/my-project', type=str)
parser.add_argument('--host', help='Host address of kfp.Client. Will be get from cluster automatically', type=str, default='')
parser.add_argument('--run_name', help='name of the new run.', type=str, default='')
parser.add_argument('--experiment_id', help = 'experiment id',type=str)
parser.add_argument('--code_source_url', help = 'url of source code', type=str, default='')
args = parser.parse_args()

if args.host:
client = kfp.Client(host=args.host)
else:
client = kfp.Client()

#create version
import os
package_url = os.path.join('https://storage.googleapis.com', args.bucket_name.lstrip('gs://'), args.commit_sha, 'pipeline.zip')
version_name = args.commit_sha
version_body = {"name": version_name, \
"code_source_url": args.code_source_url, \
"package_url": {"pipeline_url": package_url}, \
"resource_references": [{"key": {"id": args.pipeline_id, "type":3}, "relationship":1}]}

response = client.pipelines.create_pipeline_version(version_body)
version_id = response.id
# create run
run_name = args.run_name if args.run_name else 'run' + version_id
resource_references = [{"key": {"id": version_id, "type":4}, "relationship":2}]
if args.experiment_id:
resource_references.append({"key": {"id": args.experiment_id, "type":1}, "relationship": 1})
run_body={"name":run_name,
"pipeline_spec":{"parameters": [{"name": "bucket_name", "value": args.bucket_name},
{"name": "commit_sha", "value": args.commit_sha}]},
"resource_references": resource_references}
try:
client.runs.create_run(run_body)
except:
print('Error Creating Run...')




Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM python:3.7
ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \
KAGGLE_KEY=[YOUR KAGGLE KEY]
RUN pip install kaggle
RUN pip install google-cloud-storage
COPY ./download_data.py .
CMD ["python", "download_data.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: download dataset
description: visualize training in tensorboard
inputs:
- {name: bucket_name, type: GCSPath}
outputs:
- {name: train_dataset, type: string}
- {name: test_dataset, type: string}
implementation:
container:
image: download_image_location
command: ['python', 'download_data.py']
args: ['--bucket_name', {inputValue: bucket_name}]
fileOutputs:
train_dataset: /train.txt
test_dataset: /test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""
step #1: download data from kaggle website, and push it to gs bucket
"""

def process_and_upload(
bucket_name
):
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name.lstrip('gs://'))
train_blob = bucket.blob('train.csv')
test_blob = bucket.blob('test.csv')
train_blob.upload_from_filename('train.csv')
test_blob.upload_from_filename('test.csv')

with open('train.txt', 'w') as f:
f.write(bucket_name+'/train.csv')
with open('test.txt', 'w') as f:
f.write(bucket_name+'/test.csv')

if __name__ == '__main__':
import os
os.system("kaggle competitions download -c house-prices-advanced-regression-techniques")
os.system("unzip house-prices-advanced-regression-techniques")
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--bucket_name', type=str)
args = parser.parse_args()

process_and_upload(args.bucket_name)

Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import kfp.dsl as dsl
import kfp.components as components
from kfp.gcp import use_gcp_secret

@dsl.pipeline(
name = "kaggle pipeline",
description = "kaggle pipeline that goes from download data, analyse data, train model to submit result"
)
def kaggle_houseprice(
bucket_name: str,
commit_sha: str
):

downloadDataOp = components.load_component_from_file('./download_dataset/component.yaml')
downloadDataStep = downloadDataOp(bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))

visualizeTableOp = components.load_component_from_file('./visualize_table/component.yaml')
visualizeTableStep = visualizeTableOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset']).apply(use_gcp_secret('user-gcp-sa'))

visualizeHTMLOp = components.load_component_from_file('./visualize_html/component.yaml')
visualizeHTMLStep = visualizeHTMLOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset'],
commit_sha=commit_sha,
bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))

trainModelOp = components.load_component_from_file('./train_model/component.yaml')
trainModelStep = trainModelOp(train_file='%s' % downloadDataStep.outputs['train_dataset'],
test_file='%s' % downloadDataStep.outputs['test_dataset'],
bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))

submitResultOp = components.load_component_from_file('./submit_result/component.yaml')
submitResultStep = submitResultOp(result_file='%s' % trainModelStep.outputs['result'],
submit_message='submit').apply(use_gcp_secret('user-gcp-sa'))

if __name__ == '__main__':
import kfp.compiler as compiler
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--gcr_address', type = str)
args = parser.parse_args()
compiler.Compiler().compile(kaggle_houseprice, __file__ + '.zip')
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM python:3.7
ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \
KAGGLE_KEY=[YOUR KAGGLE KEY]
RUN pip install kaggle
RUN pip install gcsfs
COPY ./submit_result.py .
CMD ["python", "submit_result.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: submit result
description: submit prediction result to kaggle
inputs:
- {name: result_file, type: string}
- {name: submit_message, type: string}
implementation:
container:
image: submit_image_location
command: ['python', 'submit_result.py']
args: ['--result_file', {inputValue: result_file},
'--submit_message', {inputValue: submit_message}]
Loading

0 comments on commit 73f20d5

Please sign in to comment.