Skip to content

Latest commit

 

History

History

pyspark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PySpark

Prerequisites

Docker

Link to installation page

Once Docker is installed, try running docker pull apache/hive:4.0.0-beta-1 to verify the Docker engine is running, and you're able to pull the image needed for the tutorial.

nvm/node

The tutorial requires NodeJS version 18 or above, which can be installed using nvm (Node Version Manager). To check if you already have nvm installed on your machine, run

nvm --version

If nvm is not installed, you can run the commands below taken from this Medium Article.

brew update
brew install nvm
mkdir -p ~/.nvm

echo "export NVM_DIR=~/.nvm\nsource \$(brew --prefix nvm)/nvm.sh" >> .zshrc
source ~/.zshrc
nvm --version

Once nvm is installed, you can install and use any version of node 18 or above

nvm install 18
nvm use 18
node --version

Setup

From the pyspark folder

  1. Start the hive container, which is automatically seeded with example table schemas

    docker-compose -f ./docker/docker-compose.yml up -d 
  2. Create and activate a virtual environment, install dependencies

    The gable CLI requires the active Python environment to have the PySpark job's Python dependencies installed. For this tutorial, we're creating a new Python virtual environment, activating it, and installing the PySpark job's requirements which are defined in the requirements.txt file.

    python3 -m venv ".venv"
    source ".venv/bin/activate"
    pip3 install --pre -r requirements.txt
  3. Set your Gable API Key

    Log into Gable, and navigate to the Settings -> API Keys. Copy the API endpoint & API key values, and run the following in your terminal window

    export GABLE_API_ENDPOINT=<copied_api_endpoint>
    export GABLE_API_KEY=<copied_api_key>

Register PySpark Job Output Tables

Once the setup is complete, you're ready to register the PySpark job's output tables & their schemas!

The gable CLI needs to know the schemas of any tables the PySpark job reads from in order to compute the final output schema(s). There are currently two methods for providing the input schemas: a connection to your Hive cluster, which allows the CLI to query the information schema, or a CSV file containing the relevant schemas.

Hive

gable data-asset register --source-type pyspark \
  --project-root . \
  --spark-job-entrypoint "job.py --final_output_table pnw_bookings_30_days" \
  --connection-string hive://localhost:10000

--project-root: The path to the root of the Python project containing the PySpark job to run

--spark-job-entrypoint: The name of the entrypoint script for the PySpark job, as well as any arguments needed to run the job. If your Spar job uses config value from SparkConf, you can set the config values using the normal Spark syntax of --conf spark.my_config_key=config_value.

--connection-string: The SQLAlchemy connection string to connect to your Hive instance. Knowing the schemas of the SparkJob's input tables is required to compute the job's final output schemas.

csv

gable data-asset register --source-type pyspark \
  --project-root . \
  --spark-job-entrypoint "job.py --final_output_table pnw_bookings_30_days" \
  --csv-schema-file schemas.csv

--project-root: The path to the root of the Python project containing the PySpark job to run

--spark-job-entrypoint: The name of the entrypoint script for the PySpark job, as well as any arguments needed to run the job. If your Spar job uses config value from SparkConf, you can set the config values using the normal Spark syntax of --conf spark.my_config_key=config_value.

csv-schema-file: A CSV file containing the schema of all tables read from the PySpark job, with the header row

  • schema_table,col_name,col_type