This module covers how to use Vertex AI Workbench's "Managed Notebook Instance" for authoring Spark code in an interactive manner with Dataproc Serverless Spark interactive sessions. Understanding creation of serveless Spark interactive sessions and notebook nuances is crucial for the next module where you will run actual machine learning experiments.
Dataproc Serverless Spark Interactive is serverless, Dataproc managed, autoscaling, private infrastructure for interactive Spark code authoring via a Jupyter notebook hosted on Vertex AI Managed Notebook instance. The following is an overview of what to expect. Further in this lab guide there is detailed instructions with a pictorial overview.
We will analyze Chicago Crimes in BigQuery from a Jupyer Notebook on Vertex AI Workbench - Managed Notebook Instance using dataproc Serverless Spark interactive sessions.
Goals:
- Understand how to create and attach a Dataproc Serverless Spark interactive session to your Jupyter notebook
- Learn how to switch the Dataproc Serverless Spark interactive session created, between Jupyter notebooks
- Learn to navigate Dataproc UI for the Serverless Spark interactive session
- Browse the Spark UI of the persistent Spark History Server, for the Serverless Spark interactive session
- Learn how to analyze data in BigQuery using the BigQuery Spark connector.
Pre-requisite:
- Ensure that any preview features are allow-listed by product engineering, ahead of time
- Provisioning from module 1 needs to be successfully completed
Note:
If the notebook is not editable, make a copy and use the same.
Run the below in Cloud Shell scoped to your project. The values in these variables are needed to create the interactive Spark session - you will need to paste these into the User Interface.
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
PROJECT_NBR=`gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | tr -d "'" | xargs`
UMSA_FQN=s8s-lab-sa@$PROJECT_ID.iam.gserviceaccount.com
SPARK_CUSTOM_CONTAINER_IMAGE_URI="gcr.io/$PROJECT_ID/customer_churn_image:1.0.0"
DATAPROC_RUNTIME_VERSION="1.1"
echo "PROJECT_ID=$PROJECT_ID"
echo "PROJECT_NBR=$PROJECT_NBR"
echo "UMSA_FQN=$UMSA_FQN"
echo "SPARK_CUSTOM_CONTAINER_IMAGE_URI=$SPARK_CUSTOM_CONTAINER_IMAGE_URI"
echo " "
echo " "
Author's details:
PROJECT_ID=gcp-scalable-ml-workshop
PROJECT_NBR=xxx
UMSA_FQN=s8s-lab-sa@gcp-scalable-ml-workshop.iam.gserviceaccount.com
SPARK_CUSTOM_CONTAINER_IMAGE_URI=gcr.io/gcp-scalable-ml-workshop/customer_churn_image:1.0.0
Open JupyterLab as shown below
Be sure to select the right region in the dropdown.
Note that the varibles run in Cloud shell have all the values you need to create the session. Copy paste where needed.
Click on "submit". In less than 2 minutes, you should see a session created.
9. Place your cursor in the first cell, then following the instructions below, run all cells or run each cell sequentially
Save or discard changes as needed. Be sure to "keep session" though when prompted as you close the notebook.
This concludes the module. In the next module, you will run a complete model trainng exercise with notebooks - pre-processing, model training, hyperparameter tuning, batch scoring.