Integrating Dataproc Serverless with Airflow

Pre-reqs

Install Python Poetry

📁 Folder Structure

.
├── LICENSE
├── Makefile
├── Readme.md
├── dags
│   ├── __init__.py
│   ├── event_based_dataproc_serverless_pipeline.py
│   └── utils
│       ├── __init__.py
│       └── cleanup.py
├── dist (temp folder for packaging)
│   ├── dataproc_serverless_airflow_0.1.0.zip
│   └── main.py
├── poetry.lock
├── pyproject.toml
├── src
│   ├── __init__.py
│   ├── dataproc_serverless_airflow
│   │   ├── __init__.py
│   │   ├── read_file.py
│   │   └── save_to_bq.py
│   ├── main.py
│   └── utils
│       ├── __init__.py
│       ├── spark_setup.py
│       └── timer_utils.py
├── stocks.csv
└── stocks1.csv

Setup the Example in GCP (Run Once Only)

Edit the Makefile and change the following params:

PROJECT_ID ?= <CHANGEME>
REGION ?= <CHANGEME>
DAG_BUCKET ?= <CHANGEME>

#for example
PROJECT_ID ?= my-gcp-project-1234
REGION ?= europe-west2
DAG_BUCKET ?= my-bucket

Then run the following command from the Root Folder of this Repo

make setup

This will create the following:

GCS Bucket serverless-spark-airflow-code-repo-<PROJECT_NUMBER> →Our Pyspark Code gets uploaded here
GCS Bucket serverless-spark-airflow-staging-<PROJECT_NUMBER> →Used for BQ operations
GCS Bucket serverless-spark-airflow-data-<PROJECT_NUMBER> → Our source csv file is in here
Bigquery Dataset called serverless_spark_airflow_demo in BigQuery

Packaging you Pipeline for GCP

make build

Edit the DAG

Edit file event_based_dataproc_serverless_pipeline.py under the dags folder
Set values for the following vars (Line 27,28)
- PROJECT_NUMBER = ''
- REGION = ''

Deploy Dag to cloud composer

make dags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Integrating Dataproc Serverless with Airflow

Pre-reqs

📁 Folder Structure

Setup the Example in GCP (Run Once Only)

Packaging you Pipeline for GCP

Edit the DAG

Deploy Dag to cloud composer

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Integrating Dataproc Serverless with Airflow

Pre-reqs

📁 Folder Structure

Setup the Example in GCP (Run Once Only)

Packaging you Pipeline for GCP

Edit the DAG

Deploy Dag to cloud composer