Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 54 additions & 26 deletions labs/query-enriched-event-data-with-spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,59 +7,87 @@ Spark is a powerful, widely-adopted engine for data processing. It's easy to run
In this Lab, we'll assume you're working with a standalone Spark cluster running on your computer. However, this and other Lab notebooks can be modified to work with remote Spark clusters as well.


## Running PySpark locally with Docker
## Running this notebook

The simplest way to get started with PySpark is to run it in a [Docker](https://www.docker.com/) container. You use Docker to run the Lab notebook with a single command:
There are several ways to run this notebook locally:
- Using the `run.sh` script
- Using [Docker](https://www.docker.com/) with the `run-docker.sh` script
- Manually, using the `conda` CLI

```sh
docker run -it --rm \
-p 8888:8888 \
-v $(pwd):/home/jovyan \
jupyter/pyspark-notebook \
jupyter lab query_enriched_event_data_with_spark.ipynb
```
### Running the notebook with `run.sh`

You can use the `run.sh` script to build your environment and run this notebook with a single command.

#### Prerequisite: conda (version 4.4+)

[Anaconda]: https://www.anaconda.com/distribution/
[Miniconda]: https://docs.conda.io/en/latest/miniconda.html

You can install the `conda` CLI by installing [Anaconda] or [Miniconda].

Docker makes it easy to get started with PySpark, but it adds overhead and may require [additional configuration](https://docs.docker.com/config/containers/resource_constraints/) to handle large workloads.
#### Running Jupyter Lab

## Running Spark locally on the JVM
This lab directory contains a handy script for building your conda environment and running Jupyter Lab. To run it, simply use

```sh
bash bin/run.sh
```

Running Spark "natively" on the [Java Virtual Machine](https://en.wikipedia.org/wiki/Java_virtual_machine) takes a bit more work, but is generally preferred.
That's it, you're done!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me like a second to realize that once Jupyter was running, I should focus on the query_enriched_event_data_with_spark ipynb file. I'd be curious if this should be called out specifically or if anyone who knows what Jupyter is will know where to go


### Prerequisite: Java 8
### Running this notebook with Docker

If you want to run Spark on the JVM, you'll need to install Java 8. Visit Oracle's website to download and install the [JDK](https://www.oracle.com/java/technologies/javase-jdk8-downloads.html).
If you have [Docker](https://www.docker.com/) installed, you can run PySpark and Jupyter Lab without installing any other dependencies.

If you've got multiple Java versions installed, you'll need to set `JAVA_HOME` variable to point to version 1.8. On OS X you can use the handy `java_home` utility:
Execute `run-docker.sh` in the `./bin` directory to open Jupyter Lab in a Docker container:

```sh
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
bash bin/run-docker.sh
```

### Prerequisite: conda (version 4.4+)
**Note:** Docker makes it easy to get started with PySpark, but it adds overhead and may require [additional configuration](https://docs.docker.com/config/containers/resource_constraints/) to handle large workloads.

### Running this notebook manually

If you prefer to build and activate your conda environment manually, you can use the `conda` CLI and the environment specification files in the `./lab_env` directory to do so.

#### Prerequisite: conda (version 4.4+)

[Anaconda]: https://www.anaconda.com/distribution/
[Miniconda]: https://docs.conda.io/en/latest/miniconda.html

You can install the `conda` CLI by installing [Anaconda] or [Miniconda].

### Create and activate your Anaconda environment
#### Building and activating your Aanconda environment

The `environment.yml` file in this directory specifies the anaconda environment needed to run the Jupyter notebook in this directory. You can create or update this environment using
Start by building (or updating) and activating your anaconda environment. This step will install [OpenJDK](https://openjdk.java.net/), [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html), [Jupyter Lab](https://jupyter.org/), and other necessary dependencies.

```sh
conda env create --force --file environment.yml
conda env update --file lab_env/base.yml --name optimizelylabs
conda env update --file lab_env/labs.yml --name optimizelylabs
conda activate optimizelylabs
```

Activate this environment with
Next, install a jupyter kernel for this environment:

```sh
conda activate optimizelydata
python -m ipykernel install --user \
--name optimizelylabs \
--display-name="Python 3 (Optimizely Labs Environment)"
```

### Running the Jupyter Notebook

When you've got your environment set up, you're ready to run Jupyter Lab.
Finally, start Jupyter Lab in your working directory:

```sh
jupyter lab .
```
```

## Specifying a custom data directory

The notebook in this lab will load Enriched Event data from `example_data/` in the lab directory. If you wish to load data from another directory, you can use the `OPTIMIZELY_DATA_DIR` environment variable. For example:

```sh
export OPTIMIZELY_DATA_DIR=~/optimizely_data
```

Once `OPTIMIZELY_DATA_DIR` has been set, launch Jupyter Lab using one of the approaches described above. The Lab notebook should load data from your custom directory.
11 changes: 11 additions & 0 deletions labs/query-enriched-event-data-with-spark/bin/base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

# base.sh
#
# Sets a few useful global variables used by other scripts in the lab bin directory

# Absolute paths to lab directories
SCRIPT_DIR=$(dirname "$0")
export LAB_BASE_DIR=$(cd "$SCRIPT_DIR/.." || return; pwd)
export LAB_ENV_DIR="$LAB_BASE_DIR/lab_env"
export LAB_BIN_DIR="$LAB_BASE_DIR/bin"
52 changes: 52 additions & 0 deletions labs/query-enriched-event-data-with-spark/bin/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/bash

set -e

USAGE="
build.sh your-lab-notebook.ipynb

A handy script for building the lab directory. Does the following:
1. Remove outputs and cell metadata from the passed notebook
2. Configure the default Jupyter kernel used by the passed notebook
3. Use nbconvert to build index.md from the passed notebook
"

if [[ "$#" != 1 || "$1" == "help" ]]; then
echo "$USAGE"
exit 0
fi

SCRIPT_DIR=$(dirname "$0")
. "$SCRIPT_DIR/base.sh"
LAB_BUILD_DIR=$LAB_BASE_DIR/build

# Ensure the passed notebook exists
NB="$1"
if [[ ! -f "$NB" ]]; then
echo "Error: $NB does not exist"
exit 1
fi

# Create conda build environment
export INSTALL_BUILD_DEPENDENCIES=true
. "$LAB_BIN_DIR/env.sh"

# Backup passed notebook
echo "Backing up $NB to $LAB_BUILD_DIR/backup.ipynb"
mkdir -p "$LAB_BUILD_DIR"
cp "$NB" "$LAB_BUILD_DIR/backup.ipynb"

# 1. Remove outputs and cell metadata from the passed notebook
echo "Removing outputs and cell metadata from $NB"
nbstripout "$NB"

# 2. Configure the default Jupyter kernel used by $NB
echo "Configuring the default Jupyter kernel used by $NB"
KERNELSPEC_PATH=".metadata.kernelspec"
KERNELSPEC='{"name":"optimizelylabs", "language":"python", "display_name":"Python 3 (Optimizely Labs)"}'
UPDATED_TEMP_NB="$LAB_BUILD_DIR/with_kernelspec_updated.ipynb"
jq "$KERNELSPEC_PATH = $KERNELSPEC" "$NB" > "$UPDATED_TEMP_NB"
cp "$UPDATED_TEMP_NB" "$NB"

# 3. Use nbconvert to build index.md from the passed notebook
jupyter nbconvert --execute --to markdown --output "$LAB_BASE_DIR/index.md" "$NB"
45 changes: 45 additions & 0 deletions labs/query-enriched-event-data-with-spark/bin/env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/bin/bash

# env.sh
#
# A handy script for building and activating the conda environment required to run
# the lab notebook

SCRIPT_DIR=$(dirname "$0")
. "$SCRIPT_DIR/base.sh"

CONDA_ENV_NAME=optimizelylabs
BASE_ENV="$LAB_ENV_DIR/base.yml"
DOCKER_BASE_ENV="$LAB_ENV_DIR/docker_base.yml"
LABS_ENV="$LAB_ENV_DIR/labs.yml"
BUILD_ENV="$LAB_ENV_DIR/build.yml"

# Ensure we can use conda activate
CONDA_BASE=$(conda info --base)
source "$CONDA_BASE/etc/profile.d/conda.sh"

# Create or update the conda environment
echo "Creating conda environment $CONDA_ENV_NAME"
if [[ -n "${IN_DOCKER_CONTAINER:-}" ]]; then
echo "Running in a docker container; installing docker base dependencies"
conda env update --file "$DOCKER_BASE_ENV" --name "$CONDA_ENV_NAME"
else
echo "Not running in a docker container; installing base dependencies"
conda env update --file "$BASE_ENV" --name "$CONDA_ENV_NAME"
fi

echo "Installing Optimizely Labs dependencies"
conda env update --file "$LABS_ENV" --name "$CONDA_ENV_NAME"

if [[ -n "${INSTALL_BUILD_DEPENDENCIES:-}" ]]; then
echo "Installing build dependencies"
conda env update --file "$BUILD_ENV" --name "$CONDA_ENV_NAME"
fi

# Activate conda environment
echo "Activating conda environment $CONDA_ENV_NAME"
conda activate "$CONDA_ENV_NAME"

# Install an ipython kernel
echo "Installing ipython kernel $CONDA_ENV_NAME"
python -m ipykernel install --user --name "$CONDA_ENV_NAME" --display-name="Python 3 (Optimizely Labs Environment)"
42 changes: 42 additions & 0 deletions labs/query-enriched-event-data-with-spark/bin/run-docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/bin/bash

# run-docker.sh
#
# A handy script for running the lab notebook locally in a docker container

set -e

# Use the script path to build an absolute path for the Lab's base directory
SCRIPT_DIR=$(dirname "$0")
. "$SCRIPT_DIR/base.sh"

# The Lab directory should be mounted in ~/lab in the container
CONTAINER_HOME=/home/jovyan
CONTAINER_LAB_BASE_DIR="$CONTAINER_HOME/lab"
CONTAINER_LAB_BIN_DIR="$CONTAINER_LAB_BASE_DIR/bin"

# If OPTIMIZELY_DATA_DIR is defined, mount the specified data directory in
# the container and set the container OPTIMIZELY_DATA_DIR envar accordingly
echo "Starting docker container"
if [[ -n "${OPTIMIZELY_DATA_DIR:-}" ]]; then
CONTAINER_DATA_DIR="$CONTAINER_HOME/optimizely_data"
echo "OPTIMIZELY_DATA_DIR envar set. Mapping to $CONTAINER_DATA_DIR"

docker run -it --rm \
-p 8888:8888 \
-v "$LAB_BASE_DIR:$CONTAINER_LAB_BASE_DIR" \
-v "$OPTIMIZELY_DATA_DIR:$CONTAINER_DATA_DIR" \
-e "IN_DOCKER_CONTAINER=true" \
-e "OPTIMIZELY_DATA_DIR=$CONTAINER_DATA_DIR" \
jupyter/pyspark-notebook \
bash "$CONTAINER_LAB_BIN_DIR/run.sh"
else
docker run -it --rm \
-p 8888:8888 \
-v "$LAB_BASE_DIR:$CONTAINER_LAB_BASE_DIR" \
-e "IN_DOCKER_CONTAINER=true" \
jupyter/pyspark-notebook \
bash "$CONTAINER_LAB_BIN_DIR/run.sh"
fi


21 changes: 21 additions & 0 deletions labs/query-enriched-event-data-with-spark/bin/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash

# run.sh
#
# A handy script for running the Lab notebook locally using a conda environment

set -e

SCRIPT_DIR=$(dirname "$0")
. "$SCRIPT_DIR/base.sh"
. "$LAB_BIN_DIR/env.sh"

if [[ -z "${OPTIMIZELY_DATA_DIR:-}" ]]; then
echo "Note: If you'd like to run this notebook using data stored in a different directory, make sure"
echo " to set the OPTIMIZELY_DATA_DIR environment variable first. For example:"
echo " export OPTIMIZELY_DATA_DIR=~/optimizely_data"
fi

# Run Jupyter Lab
echo "Running Jupyter Lab in $LAB_BASE_DIR"
jupyter lab "$LAB_BASE_DIR"
16 changes: 0 additions & 16 deletions labs/query-enriched-event-data-with-spark/environment.yml

This file was deleted.

15 changes: 10 additions & 5 deletions labs/query-enriched-event-data-with-spark/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,6 @@ This Lab covers some of the basics of working with event-level experiment data.

This guide borrows some initialization code from the [Spark SQL getting started guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## How to run this notebook

This lab lives in the [Optimizely Labs](https://github.com/optimizely/labs) repository. Each lab contains a `README.md` file with instructions for running notebooks and any other executable code.

## Creating a Spark Session


Expand Down Expand Up @@ -43,7 +39,7 @@ If `OPTIMIZELY_DATA_DIR` is not set, data will be loaded from `./data` in your w
```python
import os

base_data_dir = os.environ.get("OPTIMIZELY_DATA_DIR", "./data")
base_data_dir = os.environ.get("OPTIMIZELY_DATA_DIR", "./example_data")

def read_data(path, view_name):
"""Read parquet data from the supplied path and create a corresponding temporary view with Spark."""
Expand Down Expand Up @@ -352,3 +348,12 @@ spark.sql(f"""




## How to run this notebook

This notebook lives in the [Optimizely Labs](http://github.com/optimizely/labs) repository. You can download it and everything you need to run it by doing one of the following
- Downloading a zipped copy of this Lab directory on the [Optimizely Labs page](https://www.optimizely.com/labs/computing-experiment-subjects/)
- Downloading a [zipped copy of the Optimizely Labs repository](https://github.com/optimizely/labs/archive/master.zip) from Github
- Cloning the [Github respository](http://github.com/optimizely/labs)

Once you've downloaded this Lab directory (on its own, or as part of the [Optimizely Labs](http://github.com/optimizely/labs) repository, follow the instructions in the `README.md` file for this Lab.
9 changes: 9 additions & 0 deletions labs/query-enriched-event-data-with-spark/lab_env/base.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
dependencies:
- python=3.7.6
- openjdk=8
- ipykernel=5.3.4
- jupyter=1.0.0
- jupyterlab=2.2.2
- pyspark=3.0.0
channels:
- conda-forge
4 changes: 4 additions & 0 deletions labs/query-enriched-event-data-with-spark/lab_env/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dependencies:
- nbstripout
channels:
- conda-forge
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
dependencies:
- python=3.7.6
- ipykernel=5.3.4
channels:
- conda-forge
7 changes: 7 additions & 0 deletions labs/query-enriched-event-data-with-spark/lab_env/labs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
dependencies:
- pip=20.2.1
- plotly=4.9.0
- pip:
- ssrm-test
channels:
- conda-forge
Loading