Updated the xgboost-spark sample README (#132)

Fixed a link. Clarified YAML vs TAR format for workflow specification. Made other textual improvements.
kubeflow · Nov 7, 2018 · 239295a · 239295a
1 parent 8da65c5
commit 239295a
Showing 1 changed file with 18 additions and 12 deletions.
diff --git a/samples/xgboost-spark/README.md b/samples/xgboost-spark/README.md
@@ -1,28 +1,34 @@
 ## Overview
-The pipeline creates XGBoost models on structured data with CSV format. Both classification and regression are supported.
 
-The pipeline starts by creating an Google DataProc cluster, and then run analysis, transormation, distributed training and 
+The `xgboost-training-cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
+
+The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and 
 prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to
-provide frontend the confusion matrix data. At the end, a delete cluster operation runs to destroy the cluster it creates
-in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless the pipeline fails
+provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates
+in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails
 or not.
 
 ## Requirements
-Preprocessing uses Google Cloud DataProc. So the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) needs to be enabled for the given project.
+
+Preprocessing uses Google Cloud DataProc. Therefore, you must enable the [DataProc API](https://cloud.google.com/endpoints/docs/openapi/enable-api) for the given GCP project.
 
 ## Compile
-Follow [README.md](https://github.com/kubeflow/pipelines/blob/master/samples/README.md) to install the compiler and 
-compile your sample python into workflow yaml.
+
+Follow the guide to [building a pipeline](https://github.com/kubeflow/pipelines/wiki/Build-a-Pipeline) to install the Kubeflow Pipelines SDK and compile the sample Python into a workflow specification. The specification takes the form of a YAML file compressed into a `.tar.gz` file. 
 
 ## Deploy
-Open the ML pipeline UI. Create a new pipeline, and then upload the compiled YAML file as a new pipeline template.
+
+Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (`.tar.gz` file) as a new pipeline template.
 
 ## Run
-Most arguments come with default values. Only "output" and "project" need to be filled always. "output" is a Google Storage path which holds
-pipeline run results. Note that each pipeline run will create a unique directory under output so it will not override previous results. "project"
-is a GCP project.
 
-## Components Source
+Most arguments come with default values. Only `output` and `project` need to be filled always. 
+
+* `output` is a Google Storage path which holds
+pipeline run results. Note that each pipeline run will create a unique directory under `output` so it will not override previous results. 
+* `project` is a GCP project.
+
+## Components source
 
 Create Cluster:
   [source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/create_cluster)