- Install Python Poetry
.
├── LICENSE
├── Makefile
├── Readme.md
├── dags
│ ├── __init__.py
│ ├── event_based_dataproc_serverless_pipeline.py
│ └── utils
│ ├── __init__.py
│ └── cleanup.py
├── dist (temp folder for packaging)
│ ├── dataproc_serverless_airflow_0.1.0.zip
│ └── main.py
├── poetry.lock
├── pyproject.toml
├── src
│ ├── __init__.py
│ ├── dataproc_serverless_airflow
│ │ ├── __init__.py
│ │ ├── read_file.py
│ │ └── save_to_bq.py
│ ├── main.py
│ └── utils
│ ├── __init__.py
│ ├── spark_setup.py
│ └── timer_utils.py
├── stocks.csv
└── stocks1.csv
Edit the Makefile
and change the following params:
PROJECT_ID ?= <CHANGEME>
REGION ?= <CHANGEME>
DAG_BUCKET ?= <CHANGEME>
#for example
PROJECT_ID ?= my-gcp-project-1234
REGION ?= europe-west2
DAG_BUCKET ?= my-bucket
Then run the following command from the Root Folder of this Repo
make setup
This will create the following:
- GCS Bucket
serverless-spark-airflow-code-repo-<PROJECT_NUMBER>
→Our Pyspark Code gets uploaded here - GCS Bucket
serverless-spark-airflow-staging-<PROJECT_NUMBER>
→Used for BQ operations - GCS Bucket
serverless-spark-airflow-data-<PROJECT_NUMBER>
→ Our source csv file is in here - Bigquery Dataset called
serverless_spark_airflow_demo
in BigQuery
make build
- Edit file
event_based_dataproc_serverless_pipeline.py
under thedags
folder - Set values for the following vars (Line 27,28)
- PROJECT_NUMBER = ''
- REGION = ''
make dags