Skip to content

Latest commit

 

History

History
196 lines (116 loc) · 5.48 KB

Module-02-Spark-IDE-on-GCP.md

File metadata and controls

196 lines (116 loc) · 5.48 KB

About Module 2

This module covers how to use Vertex AI Workbench's "Managed Notebook Instance" for authoring Spark code in an interactive manner with Dataproc Serverless Spark interactive sessions. Understanding creation of serveless Spark interactive sessions and notebook nuances is crucial for the next module where you will run actual machine learning experiments.


1. About Dataproc Serverless Spark Interactive

Dataproc Serverless Spark Interactive is serverless, Dataproc managed, autoscaling, private infrastructure for interactive Spark code authoring via a Jupyter notebook hosted on Vertex AI Managed Notebook instance. The following is an overview of what to expect. Further in this lab guide there is detailed instructions with a pictorial overview.

1a. Getting started - what's involved

ABOUT


1b. Creating and using an Serverless Spark Interactive session in a notebook - what's involved

ABOUT


1c. Switching notebooks and reusing the Serverless Spark Interactive session

ABOUT



2. The exercise

We will analyze Chicago Crimes in BigQuery from a Jupyer Notebook on Vertex AI Workbench - Managed Notebook Instance using dataproc Serverless Spark interactive sessions.

EXERCISE


Goals:

  1. Understand how to create and attach a Dataproc Serverless Spark interactive session to your Jupyter notebook
  2. Learn how to switch the Dataproc Serverless Spark interactive session created, between Jupyter notebooks
  3. Learn to navigate Dataproc UI for the Serverless Spark interactive session
  4. Browse the Spark UI of the persistent Spark History Server, for the Serverless Spark interactive session
  5. Learn how to analyze data in BigQuery using the BigQuery Spark connector.

Pre-requisite:

  1. Ensure that any preview features are allow-listed by product engineering, ahead of time
  2. Provisioning from module 1 needs to be successfully completed

Note:
If the notebook is not editable, make a copy and use the same.


3. Varibles you will need for this module

Run the below in Cloud Shell scoped to your project. The values in these variables are needed to create the interactive Spark session - you will need to paste these into the User Interface.

PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
PROJECT_NBR=`gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 |  tr -d "'" | xargs`
UMSA_FQN=s8s-lab-sa@$PROJECT_ID.iam.gserviceaccount.com
SPARK_CUSTOM_CONTAINER_IMAGE_URI="gcr.io/$PROJECT_ID/customer_churn_image:1.0.0"
DATAPROC_RUNTIME_VERSION="1.1"

echo "PROJECT_ID=$PROJECT_ID"
echo "PROJECT_NBR=$PROJECT_NBR"
echo "UMSA_FQN=$UMSA_FQN"
echo "SPARK_CUSTOM_CONTAINER_IMAGE_URI=$SPARK_CUSTOM_CONTAINER_IMAGE_URI"


echo " "
echo " "

Author's details:

PROJECT_ID=gcp-scalable-ml-workshop
PROJECT_NBR=xxx
UMSA_FQN=s8s-lab-sa@gcp-scalable-ml-workshop.iam.gserviceaccount.com
SPARK_CUSTOM_CONTAINER_IMAGE_URI=gcr.io/gcp-scalable-ml-workshop/customer_churn_image:1.0.0

4. Navigate on the Cloud Console to the Vertex AI Workbench, Managed Notebook Instance

Open JupyterLab as shown below

UMNBS


Be sure to select the right region in the dropdown.

UMNBS



5. Open the Chicago Crimes notebook

UMNBS



6. Click on "Launcher" to create an interactive Spark session

UMNBS



7. Key in/select from dropdown, details required

Note that the varibles run in Cloud shell have all the values you need to create the session. Copy paste where needed.

UMNBS


UMNBS


UMNBS


UMNBS


UMNBS


Click on "submit". In less than 2 minutes, you should see a session created.

UMNBS



8. Ensure you have the session you created, selected in the kernel picker dropdown

8.1. The kernel picker - where to find it

UMNBS


UMNBS


UMNBS


UMNBS


UMNBS


8.2. Choosing the interactive spark kernel

UMNBS



9. Place your cursor in the first cell, then following the instructions below, run all cells or run each cell sequentially

UMNBS


UMNBS



10. Close the notebook once the excerise is completed

Save or discard changes as needed. Be sure to "keep session" though when prompted as you close the notebook.

UMNBS


UMNBS



This concludes the module. In the next module, you will run a complete model trainng exercise with notebooks - pre-processing, model training, hyperparameter tuning, batch scoring.