kubeflow · k8s-ci-robot · Oct 8, 2019 · Oct 3, 2019
diff --git a/components/gcp/dataproc/submit_pyspark_job/README.md b/components/gcp/dataproc/submit_pyspark_job/README.md
@@ -1,31 +1,48 @@
 
 # Name
-Data preparation using PySpark on Cloud Dataproc
+Component: Data preparation using PySpark on Cloud Dataproc
 
 
-# Label
-Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components
+# Labels
+Cloud Dataproc, PySpark, Kubeflow
 
 
 # Summary
 A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc.
 
+# Facets
+<!--Make sure the asset has data for the following facets:
+Use case
+Technique
+Input data type
+ML workflow
+
+The data must map to the acceptable values for these facets, as documented on the “taxonomy” sheet of go/aihub-facets
+https://gitlab.aihub-content-external.com/aihubbot/kfp-components/commit/fe387ab46181b5d4c7425dcb8032cb43e70411c1
+--->
+Use case:
+
+Technique: 
+
+Input data type:
+
+ML workflow: 
 
 # Details
 ## Intended use
-Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline.
+Use this component to run an Apache PySpark job as one preprocessing step in a Kubeflow pipeline.
 
 
 ## Runtime arguments
 | Argument | Description | Optional | Data type | Accepted values | Default |
-|----------------------|------------|----------|--------------|-----------------|---------|
-| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID |  |  |
-| region | The Cloud Dataproc region to handle the request. | No | GCPRegion |  |  |
-| cluster_name | The name of the cluster to run the job. | No | String |  |  |
-| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath |  |  |
-| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List |  | None |
-| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict |  | None |
-| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict |  | None |
+|:----------------------|:------------|:----------|:--------------|:-----------------|:---------|
+| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | - |  - |
+| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | -  |  - |
+| cluster_name | The name of the cluster to run the job. | No | String |  - |  - |
+| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath |  - | -  |
+| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | -  | None |
+| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict |  - | None |
+| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict |  - | None |
 
 ## Output
 Name | Description | Type
@@ -50,30 +67,29 @@ This component creates a PySpark job from the [Dataproc submit job REST API](htt
 
 Follow these steps to use the component in a pipeline:
 
-1. Install the Kubeflow Pipeline SDK:
+1. Install the Kubeflow pipeline's SDK:
 
 
-```python
-%%capture --no-stderr
+    ```python
+    %%capture --no-stderr
 
-KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
-!pip3 install $KFP_PACKAGE --upgrade
-```
+    KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
+    !pip3 install $KFP_PACKAGE --upgrade
+    ```
 
-2. Load the component using KFP SDK
+2. Load the Kubeflow pipeline's SDK:
 
 
-```python
-import kfp.components as comp
+    ```python
+    import kfp.components as comp
 
-dataproc_submit_pyspark_job_op = comp.load_component_from_url(
-    'https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/submit_pyspark_job/component.yaml')
-help(dataproc_submit_pyspark_job_op)
-```
+    dataproc_submit_pyspark_job_op = comp.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/submit_pyspark_job/component.yaml')
+    help(dataproc_submit_pyspark_job_op)
+    ```
 
 ### Sample
 
-Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.
+The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.
 
 
 #### Setup a Dataproc cluster
@@ -85,17 +101,15 @@ Note: The following sample code works in an IPython notebook or directly in Pyth
 
 Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage:
 
-
 ```python
 !gsutil cat gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
 ```
 
 #### Set sample parameters
 
-
 ```python
-PROJECT_ID = '<Please put your project ID here>'
-CLUSTER_NAME = '<Please put your existing cluster name here>'
+PROJECT_ID = '<Put your project ID here>'
+CLUSTER_NAME = '<Put your existing cluster name here>'
 REGION = 'us-central1'
 PYSPARK_FILE_URI = 'gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py'
 ARGS = ''
@@ -104,7 +118,6 @@ EXPERIMENT_NAME = 'Dataproc - Submit PySpark Job'
 
 #### Example pipeline that uses the component
 
-
 ```python
 import kfp.dsl as dsl
 import kfp.gcp as gcp
@@ -147,12 +160,11 @@ compiler.Compiler().compile(pipeline_func, pipeline_filename)
 
 #### Submit the pipeline for execution
 
-
 ```python
-#Specify pipeline argument values
+#Specify values for the pipeline's arguments
 arguments = {}
 
-#Get or create an experiment and submit a pipeline run
+#Get or create an experiment
 import kfp
 client = kfp.Client()
 experiment = client.create_experiment(EXPERIMENT_NAME)