Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tech writer edits #2301

Merged
merged 1 commit into from
Oct 8, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 46 additions & 34 deletions components/gcp/dataproc/submit_pyspark_job/README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,48 @@

# Name
Data preparation using PySpark on Cloud Dataproc
Component: Data preparation using PySpark on Cloud Dataproc


# Label
Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components
# Labels
Cloud Dataproc, PySpark, Kubeflow


# Summary
A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc.

# Facets
<!--Make sure the asset has data for the following facets:
Use case
Technique
Input data type
ML workflow

The data must map to the acceptable values for these facets, as documented on the “taxonomy” sheet of go/aihub-facets
https://gitlab.aihub-content-external.com/aihubbot/kfp-components/commit/fe387ab46181b5d4c7425dcb8032cb43e70411c1
--->
Use case:

Technique:

Input data type:

ML workflow:

# Details
## Intended use
Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline.
Use this component to run an Apache PySpark job as one preprocessing step in a Kubeflow pipeline.


## Runtime arguments
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------------------|------------|----------|--------------|-----------------|---------|
| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |
| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |
| cluster_name | The name of the cluster to run the job. | No | String | | |
| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | | |
| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None |
| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | | None |
| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |
|:----------------------|:------------|:----------|:--------------|:-----------------|:---------|
| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | - | - |
| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | - | - |
| cluster_name | The name of the cluster to run the job. | No | String | - | - |
| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | - | - |
| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | - | None |
| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | - | None |
| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | - | None |

## Output
Name | Description | Type
Expand All @@ -50,30 +67,29 @@ This component creates a PySpark job from the [Dataproc submit job REST API](htt

Follow these steps to use the component in a pipeline:

1. Install the Kubeflow Pipeline SDK:
1. Install the Kubeflow pipeline's SDK:


```python
%%capture --no-stderr
```python
%%capture --no-stderr

KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
```
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
```

2. Load the component using KFP SDK
2. Load the Kubeflow pipeline's SDK:


```python
import kfp.components as comp
```python
import kfp.components as comp

dataproc_submit_pyspark_job_op = comp.load_component_from_url(
'https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/submit_pyspark_job/component.yaml')
help(dataproc_submit_pyspark_job_op)
```
dataproc_submit_pyspark_job_op = comp.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/e598176c02f45371336ccaa819409e8ec83743df/components/gcp/dataproc/submit_pyspark_job/component.yaml')
help(dataproc_submit_pyspark_job_op)
```

### Sample

Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.
The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.


#### Setup a Dataproc cluster
Expand All @@ -85,17 +101,15 @@ Note: The following sample code works in an IPython notebook or directly in Pyth

Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage:


```python
!gsutil cat gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py
```

#### Set sample parameters


```python
PROJECT_ID = '<Please put your project ID here>'
CLUSTER_NAME = '<Please put your existing cluster name here>'
PROJECT_ID = '<Put your project ID here>'
CLUSTER_NAME = '<Put your existing cluster name here>'
REGION = 'us-central1'
PYSPARK_FILE_URI = 'gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py'
ARGS = ''
Expand All @@ -104,7 +118,6 @@ EXPERIMENT_NAME = 'Dataproc - Submit PySpark Job'

#### Example pipeline that uses the component


```python
import kfp.dsl as dsl
import kfp.gcp as gcp
Expand Down Expand Up @@ -147,12 +160,11 @@ compiler.Compiler().compile(pipeline_func, pipeline_filename)

#### Submit the pipeline for execution


```python
#Specify pipeline argument values
#Specify values for the pipeline's arguments
arguments = {}

#Get or create an experiment and submit a pipeline run
#Get or create an experiment
import kfp
client = kfp.Client()
experiment = client.create_experiment(EXPERIMENT_NAME)
Expand Down