diff --git a/dataproc/README.md b/dataproc/README.md index 6bf819ccfa55..ab38fbac6f66 100644 --- a/dataproc/README.md +++ b/dataproc/README.md @@ -2,15 +2,23 @@ Sample command-line programs for interacting with the Cloud Dataproc API. + +Please see [the tutorial on the using the Dataproc API with the Python client +library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) +for more information. + Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI. `list_clusters.py` is a simple command-line program to demonstrate connecting to the Dataproc API and listing the clusters in a region -`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the +`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the `pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result. +`pyspark_sort.py_gcs` is the asme as `pyspark_sort.py` but demonstrates + reading from a GCS bucket. + ## Prerequisites to run locally: * [pip](https://pypi.python.org/pypi/pip) @@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com). Under API Manager, search for the Google Cloud Dataproc API and enable it. +## Set Up Your Local Dev Environment -# Set Up Your Local Dev Environment To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/) (recommended), run the commands within a virtualenv. * pip install -r requirements.txt -Create local credentials by running the following command and following the oauth2 flow: +## Authentication + +Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/). +The recommended approach to running these samples is a Service Account with a JSON key. + +## Environment Variables - gcloud beta auth application-default login +Set the following environment variables: + + GOOGLE_CLOUD_PROJECT=your-project-id + REGION=us-central1 # or your region + CLUSTER_NAME=waprin-spark7 + ZONE=us-central1-b + +## Running the samples To run list_clusters.py: - python list_clusters.py --region=us-central1 + python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION +`submit_job_to_cluster.py` can create the Dataproc cluster, or use an existing one. +If you'd like to create a cluster ahead of time, either use the +[Cloud Console](console.cloud.google.com) or run: -To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with -gsutil: + gcloud dataproc clusters create your-cluster-name - gsutil mb gs:// - -Then, if you want to rely on an existing cluster, run: - - python submit_job_to_cluster.py --project_id= --zone=us-central1-b --cluster_name=testcluster --gcs_bucket= - -Otherwise, if you want the script to create a new cluster for you: +To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with +gsutil: - python submit_job_to_cluster.py --project_id= --zone=us-central1-b --cluster_name=testcluster --gcs_bucket= --create_new_cluster + gsutil mb gs:// +Set the environment variable's name: -This will setup a cluster, upload the PySpark file, submit the job, print the result, then -delete the cluster. + BUCKET=your-staging-bucket + CLUSTER=your-cluster-name -You can optionally specify a `--pyspark_file` argument to change from the default -`pyspark_sort.py` included in this script to a new script. +Then, if you want to rely on an existing cluster, run: -## Running on GCE, GAE, or other environments + python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET -On Google App Engine, the credentials should be found automatically. +Otherwise, if you want the script to create a new cluster for you: -On Google Compute Engine, the credentials should be found automatically, but require that -you create the instance with the correct scopes. + python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster - gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance +This will setup a cluster, upload the PySpark file, submit the job, print the result, then +delete the cluster. -If you did not create the instance with the right scopes, you can still upload a JSON service -account and set `GOOGLE_APPLICATION_CREDENTIALS`. See [Google Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials) for more details. +You can optionally specify a `--pyspark_file` argument to change from the default +`pyspark_sort.py` included in this script to a new script. diff --git a/dataproc/pyspark_sort.py b/dataproc/pyspark_sort.py index 14b66995380e..518e906eeaf4 100644 --- a/dataproc/pyspark_sort.py +++ b/dataproc/pyspark_sort.py @@ -24,5 +24,5 @@ sc = pyspark.SparkContext() rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther']) words = sorted(rdd.collect()) -print words +print(words) # [END pyspark] diff --git a/dataproc/pyspark_sort_gcs.py b/dataproc/pyspark_sort_gcs.py new file mode 100644 index 000000000000..70f77d8dc6c6 --- /dev/null +++ b/dataproc/pyspark_sort_gcs.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" Sample pyspark script to be uploaded to Cloud Storage and run on +Cloud Dataproc. + +Note this file is not intended to be run directly, but run inside a PySpark +environment. + +This file demonstrates how to read from a GCS bucket. See README.md for more +information. +""" + +# [START pyspark] +import pyspark + +sc = pyspark.SparkContext() +rdd = sc.textFile('gs://path-to-your-GCS-file') +print(sorted(rdd.collect())) +# [END pyspark]