Instacart Data Pipeline Using Apache Airflow

1. Project Overview

This project aims to create a comprehensive data processing pipeline and other data analysis for Instacart market basket analysis dataset. The implementation involves

Setting up a PostgreSQL database
Creating a custom Python package to manage the ETL functions
Workflow automation using a custom Docker image based on Apache Airflow
Exploratory analysis using Python and SQL

2. Project Structure

2.1. Database Setup

Creation: Create a PostgreSQL database named instacart_db
Initialization: The database is initialized to create tables, views, triggers and stored procedures. This can by done manually using the sql scripts in sql directory or by triggering the Database_Setup dag
Views:
- Organic v/s non organic products
- Reorder proportion by product
- Customers who always reorder
- See the top 10 products sold
Triggers:
- Maintain an audit log whenever a new order is inserted into the fact_orders table
- Ensure that the add_to_cart_order is always a positive integer

2.2. Create instacart_data_pipeline Package

Set Environment Variables & Configuration

Create a .env file on the same level as setup.py

.env example

INSTACART_CONFIG_FILE_PATH=./path/to/yout/config.yaml

config.yaml example

database:
  host: localhost
  port: 5432
  user: postgres
  password: postgres
  dbname: instacart_db

data_ingestion:
  source_url: https://p-def3.pcloud.com/D4ZyfN51vZ8h8Haj7ZZZddHxXkZ2ZZ8uXZkZPO7ZO8ZrHZSRZfiRp5ZLuqC8451j00IAlmoLLJNlQrWVTXV/instacart_dataset.zip
  download_dir: ./data/raw
  unzip_dir: ./data/raw
  filename: instacart_dataset.zip

data_transformation:
  source_dir: ./data/raw
  output_dir: ./data/processed
  
data_loading:
  source_dir: ./data/processed

Create a Source Distribution

The src folder contains all the source modules and packages. Use the setup.py to create a source distribution of instacart_data_pipeline package

python setup.py sdist

2.3. Build a Custom Airflow Image

Set Environment Variables

You can set environment variables in the docker-compose.yml or create a .env file in the same directory. Provide the config file for instacart_data_pipeline package in /ariflow/config

.env example

AIRFLOW_UID=50000

instacart_config.yaml example

database:
  host: host.docker.internal
  port: 5432
  user: postgres
  password: postgres
  dbname: instacart_db

data_ingestion:
  source_url: https://p-def3.pcloud.com/D4ZyfN51vZ8h8Haj7ZZZddHxXkZ2ZZ8uXZkZPO7ZO8ZrHZSRZfiRp5ZLuqC8451j00IAlmoLLJNlQrWVTXV/instacart_dataset.zip
  download_dir: ./data/raw
  unzip_dir: ./data/raw
  filename: instacart_dataset.zip

data_transformation:
  source_dir: ./data/raw
  output_dir: ./data/processed
  
data_loading:
  source_dir: ./data/processed

Create custom Airflow Image

Navigate to the airflow directory

cd airflow

Build the image and run the airflow container

docker compose up --build

2.4. Running the Container

Access the webserver

The webserver is available at: http://localhost:8080. The default account has the login airflow and the password airflow

Set Connections

Example connection

Connection id   - postgres_default
Connection Type - postgres
Host            - host.docker.internal
Database        - instacart_db
Login           - Postgres
Port            - 5432

Modify PostgreSQL Server Configuration

Ensure that PostgreSQL server is up and running. By default the PostgreSQL server running on the host is configured to accept connection from localhost only. You need to edit the pg_hba.conf file to include

host    all             all             172.17.0.1/32            scram-sha-256

This entry allows all users to connect to all databases from the IP address 172.17.0.1 (replace with your gateway address) using the scram-sha-256 authentication method.

To find your gateway address

# Find your container's network (by default container name is airflow-default running on bridge network)
docker network ls

# Inspect the network settings and find the gatway parameter
docker inspect <container_name_or_ID>

Restart PostgreSQL service to ensure changes are applied

Stopping the Container

cd airflow
docker compose down

2.5. Exploratory Analysis Using Python & SQL

The dataset and schema are available....

Data Analysis was conducted in both Python and SQL to answer the following questions

When do people order - Hour of day, Day of week ?
How many days since the recent order ?
How many prior orders are there ?
What is the most common basket size ?
What are the top 10 products sold ?
What is the proportion of reordered orders ?
What are the most frequently reordered products ?
Which are the top 10 items added to cart first ?
Association between days since last order and reorder trends
Association between number of orders and probability of reordering
Percentage of Organic v/s Non-Organic products
Percentage of Organic v/s Non-Organic re-orders
Product Portfolio
Sales Portfolio

3. How to Run the Pipeline

The preceding sections detail each step of the pipeline. This section provides a summary of the overall process.

Create the PostgreSQL database (on the Host)
Ensure Docker Desktop Community edition is installed and running
Set the environment variables and config file for the airflow container
Build and Start the Airflow container
Navigate to http://localhost:8080 and login with default credentials (username - airflow, password - airflow)
Set the Connections
Trigger the Database_Setup dag (once)
ETL_Pipeline dag is scheduled to trigger daily

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
airflow		airflow
config		config
data		data
images		images
notebooks		notebooks
sql		sql
src/instacart_data_pipeline		src/instacart_data_pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instacart Data Pipeline Using Apache Airflow

1. Project Overview

2. Project Structure

2.1. Database Setup

2.2. Create instacart_data_pipeline Package

Set Environment Variables & Configuration

Create a Source Distribution

2.3. Build a Custom Airflow Image

Set Environment Variables

Create custom Airflow Image

2.4. Running the Container

Access the webserver

Set Connections

Modify PostgreSQL Server Configuration

Stopping the Container

2.5. Exploratory Analysis Using Python & SQL

3. How to Run the Pipeline

About

Releases

Packages

Languages

License

manuethomas/Instacart-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Instacart Data Pipeline Using Apache Airflow

1. Project Overview

2. Project Structure

2.1. Database Setup

2.2. Create instacart_data_pipeline Package

Set Environment Variables & Configuration

Create a Source Distribution

2.3. Build a Custom Airflow Image

Set Environment Variables

Create custom Airflow Image

2.4. Running the Container

Access the webserver

Set Connections

Modify PostgreSQL Server Configuration

Stopping the Container

2.5. Exploratory Analysis Using Python & SQL

3. How to Run the Pipeline

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages