This repo is to capture the steps for Airflow installation, setup and creating some simple data pipelines based on udemy tutorial by Marc.
- Create a python virtual environment to have an isolated environment for this exercise
python3 -m venv sandbox
- Activate the venv - sandbox as below -
source sandbox/bin/activate
- Install Apache Airflow
pip install wheel
pip3 install apache-airflow==2.0.0 --constraint <constraints-file.txt>
In this case, we are installing Airflow 2.0.0 version. Wheel is a built-package format, and offers the advantage of not recompiling your software during every install.
- Initiate Apache Airflow Metastore
airflow db init
By default, Airflow uses SQLite, which is intended for development purposes only.
- Create an admin account
airflow users create \
--username admin \
--firstname Peter \
--lastname Parker \
--role Admin \
--email spiderman@superhero.org
- Start the web server, default port is 8080 and start the scheduler in new terminal or run webserver with
-D
option to run it as a daemon
airflow webserver
airflow scheduler
Go to localhost:8080 in the browser and use the admin account you just created to login.
To execute multiple tasks in parallel where a task becomes a sub process in a single machine.
- Install Postgres for metastore
# To refresh your packages in your virtual env
sudo apt update
# To install Postgres
sudo apt install postgresql
- Connect to postgres and set password for postgres user
sudo -u postgres psql
ALTER USER postgres PASSWORD 'postgres';
- Install airflow postgres package
pip install 'apache-airflow[postgres]'
- Edit airflow.cfg file for using postgres as Metastore and set LocalExecutor
sql_alchemy_conn = postgresql+psycopg2://postgres:postgres@localhost/postgres
executor = LocalExecutor
#Check if you are able to reach postgres database -
airflow db check
-
Stop airflow webserver and scheduler
-
Initiate Apache Airflow Metastore
airflow db init
- Re-create admin account
airflow users create \
--username admin \
--firstname Peter \
--lastname Parker \
--role Admin \
--email spiderman@superhero.org
- Start the web server and scheduler
airflow webserver
airflow scheduler
Go to localhost:8080 and verify
Structure of the Airflow User Data Pipeline
As part of this example, followings are covered -
- Created ElasticHook to connect to Elastic search and add a doc to index
- Created PgToEsOperator to get data from postgres and add it to Elastic search
- Created a dag to print elastic search info and get connections from postgres and add it to elastic search
- add more details here