An Apache Airflow ETL pipeline to process employee data by downloading, validating, transforming, and loading it into a PostgreSQL database.
- Python 3.9+
- Apache Airflow 2.8+
-
Clone the repository:
git clone https://github.com/FistGang/ProcessEmployeesETL.git cd ProcessEmployeeETL.git -
Set up a virtual environment (optional):
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Configure Airflow:
-
Initialize the Airflow database:
airflow db init
-
Create a PostgreSQL connection in Airflow with ID
pg_conn.- Connection Id:
pg_conn - Connection Type: postgres
- Host: postgres
- Schema: airflow
- Login: airflow
- Password: airflow
- Port: 5432
- Connection Id:
-
-
Start Airflow:
airflow webserver --port 8080 airflow scheduler
-
Deploy the DAG:
- Place
process_employees.pyin your Airflow DAGs folder.
- Place
# Download the docker-compose.yaml file
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
# Make expected directories and set an expected environment variable
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env
# Initialize the database (optional)
docker-compose up airflow-init
# Start up all services
docker-compose upThen create a PostgreSQL connection in Airflow with ID pg_conn .
-
Trigger the DAG:
- Open the Airflow web UI at
http://localhost:8080. - Activate and manually trigger the
process_employees_etl_pipelineDAG.
- Open the Airflow web UI at
-
Monitor:
- Use the Airflow UI to monitor DAG execution and check logs for errors.
-
Create Tables:
employees: Main table.employees_temp: Temporary table for data processing.
-
Get Data:
- Download and validate the employee CSV file.
-
Transform Data:
- Add a processed timestamp, format columns, and categorize leave.
-
Load Data:
- Load transformed data into
employees_temp.
- Load transformed data into
-
Merge Data:
- Merge data from
employees_tempintoemployees.
- Merge data from
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or issues, please open an issue on GitHub.