To run notebooks using Spark local mode on a server with one or more NVIDIA GPUs:
- Follow the installation instructions to setup your environment.
- Install
jupyter
into the conda environment.pip install jupyter
- Set
SPARK_HOME
.export SPARK_HOME=$( pip show pyspark | grep Location | grep -o '/.*' )/pyspark ls $SPARK_HOME/bin/pyspark
- In the notebooks directory, start PySpark in local mode with the Jupyter UI.
cd spark-rapids-ml/notebooks PYSPARK_DRIVER_PYTHON=jupyter \ PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=0.0.0.0' \ CUDA_VISIBLE_DEVICES=0 \ $SPARK_HOME/bin/pyspark --master local[12] \ --driver-memory 128g \ --conf spark.sql.execution.arrow.pyspark.enabled=true
- Follow the instructions printed by the above command to browse to the Jupyter notebook server.
- In the Jupyter file browser, open and run any of the notebooks.
- OPTIONAL: If your server is remote with no direct
http
access, but you havessh
access, you can connect via anssh
tunnel, as follows:Then, browse to theexport REMOTE_USER=<your_remote_username> export REMOTE_HOST=<your_remote_hostname> ssh -A -L 8888:127.0.0.1:8888 -L 4040:127.0.0.1:4040 ${REMOTE_USER}@${REMOTE_HOST}
127.0.0.1
URL printed by the command in step 4. Note that a tunnel is also opened to the Spark UI server on port 4040. Once a notebook is opened, you can view it by browsing to http://127.0.0.1:4040 in another tab or window. - OPTIONAL: If you have multiple GPUs in your server, replace the
CUDA_VISIBLE_DEVICES
setting in step 4 with a comma-separated list of the corresponding indices. For example, for two GPUs useCUDA_VISIBLE_DEVICES=0,1
.
See these instructions for running the notebooks in a Databricks Spark cluster.