forked from kubeflow/pipelines
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Sample] CI Sample: Kaggle (kubeflow#3021)
* kaggle sample * code path * fix typo * visualize table component * visualize html * train model step * submit result * real image * fix typo * push before use * sed to replace image in component.yaml * general instructions * typos; more robust; better code style * notice about gcp sa and workload identity choice
- Loading branch information
Showing
19 changed files
with
745 additions
and
0 deletions.
There are no files selected for viewing
24 changes: 24 additions & 0 deletions
24
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Kaggle Competition Pipeline Sample | ||
|
||
## Pipeline Overview | ||
|
||
This is a pipeline for [house price prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques), an entry-level competition in kaggle. We demonstrate how to complete a kaggle competition by creating a pipeline of steps including downloading data, preprocessing and visualizing data, train model and submitting results to kaggle website. | ||
|
||
* We refer to [the notebook by Raj Kumar Gupta](https://www.kaggle.com/rajgupta5/house-price-prediction) and [the notebook by Sergei Neviadomski](https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn) in terms of model implementation as well as data visualization. | ||
|
||
* We use [kaggle python api](https://github.com/Kaggle/kaggle-api) to interact with kaggle site, such as downloading data and submiting result. More usage can be found in their documentation. | ||
|
||
* We use [cloud build](https://cloud.google.com/cloud-build/) for CI process. That is, we automatically triggered a build and run as soon as we pushed our code to github repo. You need to setup a trigger on cloud build for your github repo branch to achieve the CI process. | ||
|
||
## Notice | ||
* You can authenticate to gcp services by either: Create a "user-gcp-sa" secret following the troubleshooting parts in [Kubeflow pipeline repo](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), or configure workload identity as instructed in [this guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). This sample uses the first method, but this will soon be deprecated. We would recommend using second method to replace the use of "user-gcp-sa" service account in the future. | ||
|
||
## Usage | ||
|
||
* Substitute the constants in "substitutions" in cloudbuild.yaml | ||
* Fill in your kaggle_username and kaggle_key in Dockerfiles(in the folder "download_dataset" and "submit_result") to authenticate to kaggle. You can get them from an API token created from your kaggle "My Account" page. | ||
* Set up cloud build triggers to your github repo for Continuous Integration | ||
* Replace the CLOUDSDK_COMPUTE_ZONE, CLOUDSDK_CONTAINER_CLUSTER in cloudbuild.yaml with your own zone and cluster | ||
* Enable "Kubernetes Engine Developer" in cloud build setting | ||
* Set your gs bucket public or grant cloud storage access to cloud build and kubeflow pipeline | ||
* Try commit and push it to github repo |
183 changes: 183 additions & 0 deletions
183
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/cloudbuild.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,183 @@ | ||
steps: | ||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"build", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_download:$COMMIT_SHA", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_download:latest", | ||
"${_CODE_PATH}/download_dataset", | ||
"-f", | ||
"${_CODE_PATH}/download_dataset/Dockerfile", | ||
] | ||
id: "BuildDownloadDataImage" | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"push", | ||
"${_GCR_PATH}/kaggle_download:$COMMIT_SHA", | ||
] | ||
id: "PushDownloadDataImage" | ||
waitFor: ["BuildDownloadDataImage"] | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"build", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_visualize_table:latest", | ||
"${_CODE_PATH}/visualize_table", | ||
"-f", | ||
"${_CODE_PATH}/visualize_table/Dockerfile", | ||
] | ||
id: "BuildVisualizeTableImage" | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"push", | ||
"${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA", | ||
] | ||
id: "PushVisualizeTableImage" | ||
waitFor: ["BuildVisualizeTableImage"] | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"build", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_visualize_html:latest", | ||
"${_CODE_PATH}/visualize_html", | ||
"-f", | ||
"${_CODE_PATH}/visualize_html/Dockerfile", | ||
] | ||
id: "BuildVisualizeHTMLImage" | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"push", | ||
"${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA", | ||
] | ||
id: "PushVisualizeHTMLImage" | ||
waitFor: ["BuildVisualizeHTMLImage"] | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"build", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_train:$COMMIT_SHA", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_train:latest", | ||
"${_CODE_PATH}/train_model", | ||
"-f", | ||
"${_CODE_PATH}/train_model/Dockerfile", | ||
] | ||
id: "BuildTrainImage" | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"push", | ||
"${_GCR_PATH}/kaggle_train:$COMMIT_SHA", | ||
] | ||
id: "PushTrainImage" | ||
waitFor: ["BuildTrainImage"] | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"build", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_submit:$COMMIT_SHA", | ||
"-t", | ||
"${_GCR_PATH}/kaggle_submit:latest", | ||
"${_CODE_PATH}/submit_result", | ||
"-f", | ||
"${_CODE_PATH}/submit_result/Dockerfile", | ||
] | ||
id: "BuildSubmitImage" | ||
|
||
- name: "gcr.io/cloud-builders/docker" | ||
args: | ||
[ | ||
"push", | ||
"${_GCR_PATH}/kaggle_submit:$COMMIT_SHA", | ||
] | ||
id: "PushSubmitImage" | ||
waitFor: ["BuildSubmitImage"] | ||
|
||
- name: "python:3.7-slim" | ||
entrypoint: "/bin/sh" | ||
args: [ | ||
"-c", | ||
"set -ex; | ||
cd ${_CODE_PATH}; | ||
pip3 install cffi==1.12.3 --upgrade; | ||
pip3 install kfp==0.1.38; | ||
sed -i 's|image: download_image_location|image: ${_GCR_PATH}/kaggle_download:$COMMIT_SHA|g' ./download_dataset/component.yaml; | ||
sed -i 's|image: visualizetable_image_location|image: ${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA|g' ./visualize_table/component.yaml; | ||
sed -i 's|image: visualizehtml_image_location|image: ${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA|g' ./visualize_html/component.yaml; | ||
sed -i 's|image: train_image_location|image: ${_GCR_PATH}/kaggle_train:$COMMIT_SHA|g' ./train_model/component.yaml; | ||
sed -i 's|image: submit_image_location|image: ${_GCR_PATH}/kaggle_submit:$COMMIT_SHA|g' ./submit_result/component.yaml; | ||
python pipeline.py | ||
--gcr_address ${_GCR_PATH}; | ||
cp pipeline.py.zip /workspace/pipeline.zip", | ||
] | ||
id: "KagglePackagePipeline" | ||
|
||
- name: "gcr.io/cloud-builders/gsutil" | ||
args: | ||
[ | ||
"cp", | ||
"/workspace/pipeline.zip", | ||
"${_GS_BUCKET}/$COMMIT_SHA/pipeline.zip" | ||
] | ||
id: "KaggleUploadPipeline" | ||
waitFor: ["KagglePackagePipeline"] | ||
|
||
|
||
- name: "gcr.io/cloud-builders/kubectl" | ||
entrypoint: "/bin/sh" | ||
args: [ | ||
"-c", | ||
"cd ${_CODE_PATH}; | ||
apt-get update; | ||
apt-get install -y python3-pip; | ||
apt-get install -y libssl-dev libffi-dev; | ||
/builder/kubectl.bash; | ||
pip3 install kfp; | ||
pip3 install kubernetes; | ||
python3 create_pipeline_version_and_run.py | ||
--pipeline_id ${_PIPELINE_ID} | ||
--commit_sha $COMMIT_SHA | ||
--bucket_name ${_GS_BUCKET} | ||
--gcr_address ${_GCR_PATH}" | ||
] | ||
env: | ||
- "CLOUDSDK_COMPUTE_ZONE=[Your cluster zone, for example: us-central1-a]" | ||
- "CLOUDSDK_CONTAINER_CLUSTER=[Your cluster name, for example: my-cluster]" | ||
id: "KaggleCreatePipelineVersionAndRun" | ||
|
||
images: | ||
- "${_GCR_PATH}/kaggle_download:latest" | ||
- "${_GCR_PATH}/kaggle_visualize_table:latest" | ||
- "${_GCR_PATH}/kaggle_visualize_html:latest" | ||
- "${_GCR_PATH}/kaggle_train:latest" | ||
- "${_GCR_PATH}/kaggle_submit:latest" | ||
|
||
|
||
substitutions: | ||
_CODE_PATH: /workspace/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample | ||
_NAMESPACE: kubeflow | ||
_GCR_PATH: [Your cloud registry path. For example, gcr.io/my-project-id] | ||
_GS_BUCKET: [Name of your cloud storage bucket. For example, gs://my-project-bucket] | ||
_PIPELINE_ID: [Your kubeflow pipeline id to create a version on. Get it from Kubeflow Pipeline UI. | ||
For example, f6f8558a-6eec-4ef4-b343-a650473ee613] |
47 changes: 47 additions & 0 deletions
47
...contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/create_pipeline_version_and_run.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
import kfp | ||
import argparse | ||
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--commit_sha', help='Required. Commit SHA, for version name. Must be unique.', type=str) | ||
parser.add_argument('--pipeline_id', help = 'Required. pipeline id',type=str) | ||
parser.add_argument('--bucket_name', help='Required. gs bucket to store files', type=str) | ||
parser.add_argument('--gcr_address', help='Required. Cloud registry address. For example, gcr.io/my-project', type=str) | ||
parser.add_argument('--host', help='Host address of kfp.Client. Will be get from cluster automatically', type=str, default='') | ||
parser.add_argument('--run_name', help='name of the new run.', type=str, default='') | ||
parser.add_argument('--experiment_id', help = 'experiment id',type=str) | ||
parser.add_argument('--code_source_url', help = 'url of source code', type=str, default='') | ||
args = parser.parse_args() | ||
|
||
if args.host: | ||
client = kfp.Client(host=args.host) | ||
else: | ||
client = kfp.Client() | ||
|
||
#create version | ||
import os | ||
package_url = os.path.join('https://storage.googleapis.com', args.bucket_name.lstrip('gs://'), args.commit_sha, 'pipeline.zip') | ||
version_name = args.commit_sha | ||
version_body = {"name": version_name, \ | ||
"code_source_url": args.code_source_url, \ | ||
"package_url": {"pipeline_url": package_url}, \ | ||
"resource_references": [{"key": {"id": args.pipeline_id, "type":3}, "relationship":1}]} | ||
|
||
response = client.pipelines.create_pipeline_version(version_body) | ||
version_id = response.id | ||
# create run | ||
run_name = args.run_name if args.run_name else 'run' + version_id | ||
resource_references = [{"key": {"id": version_id, "type":4}, "relationship":2}] | ||
if args.experiment_id: | ||
resource_references.append({"key": {"id": args.experiment_id, "type":1}, "relationship": 1}) | ||
run_body={"name":run_name, | ||
"pipeline_spec":{"parameters": [{"name": "bucket_name", "value": args.bucket_name}, | ||
{"name": "commit_sha", "value": args.commit_sha}]}, | ||
"resource_references": resource_references} | ||
try: | ||
client.runs.create_run(run_body) | ||
except: | ||
print('Error Creating Run...') | ||
|
||
|
||
|
||
|
7 changes: 7 additions & 0 deletions
7
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
FROM python:3.7 | ||
ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \ | ||
KAGGLE_KEY=[YOUR KAGGLE KEY] | ||
RUN pip install kaggle | ||
RUN pip install google-cloud-storage | ||
COPY ./download_data.py . | ||
CMD ["python", "download_data.py"] |
15 changes: 15 additions & 0 deletions
15
...es/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
name: download dataset | ||
description: visualize training in tensorboard | ||
inputs: | ||
- {name: bucket_name, type: GCSPath} | ||
outputs: | ||
- {name: train_dataset, type: string} | ||
- {name: test_dataset, type: string} | ||
implementation: | ||
container: | ||
image: download_image_location | ||
command: ['python', 'download_data.py'] | ||
args: ['--bucket_name', {inputValue: bucket_name}] | ||
fileOutputs: | ||
train_dataset: /train.txt | ||
test_dataset: /test.txt |
31 changes: 31 additions & 0 deletions
31
.../contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/download_data.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
""" | ||
step #1: download data from kaggle website, and push it to gs bucket | ||
""" | ||
|
||
def process_and_upload( | ||
bucket_name | ||
): | ||
from google.cloud import storage | ||
storage_client = storage.Client() | ||
bucket = storage_client.get_bucket(bucket_name.lstrip('gs://')) | ||
train_blob = bucket.blob('train.csv') | ||
test_blob = bucket.blob('test.csv') | ||
train_blob.upload_from_filename('train.csv') | ||
test_blob.upload_from_filename('test.csv') | ||
|
||
with open('train.txt', 'w') as f: | ||
f.write(bucket_name+'/train.csv') | ||
with open('test.txt', 'w') as f: | ||
f.write(bucket_name+'/test.csv') | ||
|
||
if __name__ == '__main__': | ||
import os | ||
os.system("kaggle competitions download -c house-prices-advanced-regression-techniques") | ||
os.system("unzip house-prices-advanced-regression-techniques") | ||
import argparse | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--bucket_name', type=str) | ||
args = parser.parse_args() | ||
|
||
process_and_upload(args.bucket_name) | ||
|
40 changes: 40 additions & 0 deletions
40
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/pipeline.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
import kfp.dsl as dsl | ||
import kfp.components as components | ||
from kfp.gcp import use_gcp_secret | ||
|
||
@dsl.pipeline( | ||
name = "kaggle pipeline", | ||
description = "kaggle pipeline that goes from download data, analyse data, train model to submit result" | ||
) | ||
def kaggle_houseprice( | ||
bucket_name: str, | ||
commit_sha: str | ||
): | ||
|
||
downloadDataOp = components.load_component_from_file('./download_dataset/component.yaml') | ||
downloadDataStep = downloadDataOp(bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa')) | ||
|
||
visualizeTableOp = components.load_component_from_file('./visualize_table/component.yaml') | ||
visualizeTableStep = visualizeTableOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset']).apply(use_gcp_secret('user-gcp-sa')) | ||
|
||
visualizeHTMLOp = components.load_component_from_file('./visualize_html/component.yaml') | ||
visualizeHTMLStep = visualizeHTMLOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset'], | ||
commit_sha=commit_sha, | ||
bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa')) | ||
|
||
trainModelOp = components.load_component_from_file('./train_model/component.yaml') | ||
trainModelStep = trainModelOp(train_file='%s' % downloadDataStep.outputs['train_dataset'], | ||
test_file='%s' % downloadDataStep.outputs['test_dataset'], | ||
bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa')) | ||
|
||
submitResultOp = components.load_component_from_file('./submit_result/component.yaml') | ||
submitResultStep = submitResultOp(result_file='%s' % trainModelStep.outputs['result'], | ||
submit_message='submit').apply(use_gcp_secret('user-gcp-sa')) | ||
|
||
if __name__ == '__main__': | ||
import kfp.compiler as compiler | ||
import argparse | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('--gcr_address', type = str) | ||
args = parser.parse_args() | ||
compiler.Compiler().compile(kaggle_houseprice, __file__ + '.zip') |
7 changes: 7 additions & 0 deletions
7
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/Dockerfile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
FROM python:3.7 | ||
ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \ | ||
KAGGLE_KEY=[YOUR KAGGLE KEY] | ||
RUN pip install kaggle | ||
RUN pip install gcsfs | ||
COPY ./submit_result.py . | ||
CMD ["python", "submit_result.py"] |
11 changes: 11 additions & 0 deletions
11
samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: submit result | ||
description: submit prediction result to kaggle | ||
inputs: | ||
- {name: result_file, type: string} | ||
- {name: submit_message, type: string} | ||
implementation: | ||
container: | ||
image: submit_image_location | ||
command: ['python', 'submit_result.py'] | ||
args: ['--result_file', {inputValue: result_file}, | ||
'--submit_message', {inputValue: submit_message}] |
Oops, something went wrong.