Project - Build a data warehouse using sql

Build a modern data warehouse with postgresql/snowflake/duckDB, including ETL processes, data modelling, and analytics.

Prerequisites and Setup

We use Ubuntu 20.04 LTS AWS EC2 and GCP machine for the project.

We need the following:

To make things easier I have scripted these prerequisites. Just clone my repo and run the instructions I provide.

sudo apt update && sudo apt install git make -y
git clone https://github.com/deepakramani/sql-data-warehouse-project.git
cd sql-data-warehouse-project
make install_docker
source ~/.bashrc

Logout and log in back to the instance. To test docker if it is working, run

docker run --rm hello-world # should return "Hello from Docker!" without errors

Set environment variables:

export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=postgres
export POSTGRES_HOST=postgres
export POSTGRES_DB=sql_dwh_db

Now we're ready to start our project.

cd ~/sql-data-warehouse-project
make up # creates the database and schemas for the medallion architecture

Extract-Transform-Load(ETL) Process

We begin the ETL process of the data pipeline with the bronze layer. This layer is usually called staging layer as the raw data from various sources are dumped into temp tables. We choose full-extract type instead of incremental here. Raw data are in csv files.

Bronze Layer

Setup bronze layer and populate raw data to tables

cd ~/sql-data-warehouse-project
make setup_bronze_tables
make populate_bronze_tables

Now the bronze layer tables have raw data populated.

Silver Layer

Setup silver layer table

(The script is designed to work only once in the beginning since gold layer table(downstream) is dependent on silver tables)

cd ~/sql-data-warehouse-project
make setup_silver_tables

In the Silver layer, data cleansing and transformation on the bronze layer tables are carried out.

Data Cleansing

This step includes -

Remove duplicate entries
data filtering
handling missing/invalid data
handling unwanted white spaces
data type casting
outlier detection

Data Transformation

This step includes -

Data enrichment
data integration
deriving new columns
data aggregations
applying busiess rules and logic
data normalisation and standardisation

Populate Silver layer tables

cd ~/sql-data-warehouse-project
make populate_silver_tables

As the above data integration image indicates, we design the silver t

Gold Layer(Business Logic layer)

We use the below image to see how data from the source comes downstream to the destination in gold layer views.

Source: Data with Barra

Before the dimensional and fact tables are created, it is important to know the relationship between tables. The below data integration image shows how one table is related to other. This diagram helps in making joins with other tables using that specific key.

cd ~/sql-data-warehouse-project
make setup_gold_layer

Data Mart

In the image below, we can see how the gold layer dimensional and fact tables/views(data marts) are created.

Testing - Data quality checks

Testing data quality is an integral part of ETL process. It ensures bad data doesn't get transmitted to the stakeholders or business users potentially avoiding dire consequences.

Here we check data integrity, consistency, and correctness. It verifies primary key uniqueness, standardization, referential integrity, and logical correctness of business rules.

Silver - data quality check

cd ~/sql-data-warehouse-project
make check_silver_data_quality

Gold - data quality check

cd ~/sql-data-warehouse-project
make check_gold_data_quality

Data Catalog

Gold layer views are usually used by Business users. In order to help them understand what each row in the table/view represent is important. Therefore I provide a catalog file that gives the metadata of the views created.

Data Catalog

Future developments

Take historical data into account
Do incremental data load than full load each time
Make use of an orchestration tool such as Dagster to orchestrate ETL process
Do EDA on the gold layer and derive reports for business users

Deleting resources

To bring down all container and return to the original state, run the following instructions

make down

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
etl_scripts		etl_scripts
input_data		input_data
postgres		postgres
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project - Build a data warehouse using sql

Prerequisites and Setup

Extract-Transform-Load(ETL) Process

Bronze Layer

Setup bronze layer and populate raw data to tables

Silver Layer

Setup silver layer table

Data Cleansing

Data Transformation

Populate Silver layer tables

Gold Layer(Business Logic layer)

Data Mart

Testing - Data quality checks

Silver - data quality check

Gold - data quality check

Data Catalog

Future developments

Deleting resources

About

Uh oh!

Languages

License

deepakramani/sql-data-warehouse-project

Folders and files

Latest commit

History

Repository files navigation

Project - Build a data warehouse using sql

Prerequisites and Setup

Extract-Transform-Load(ETL) Process

Bronze Layer

Setup bronze layer and populate raw data to tables

Silver Layer

Setup silver layer table

Data Cleansing

Data Transformation

Populate Silver layer tables

Gold Layer(Business Logic layer)

Data Mart

Testing - Data quality checks

Silver - data quality check

Gold - data quality check

Data Catalog

Future developments

Deleting resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages