Sprint project 01 - AnyoneAI ML Career

E-Commerce Data Pipeline (with Airflow)

This version of the problem is focused completly on automatizate the ELT process purposed, through the use of the tool Airflow that allows to us schedule our data pipeline into get all the time the most recent processed data to train our machine learning models.

The transformed results will saved on an sqlite database olist_transformed.db

Project Structure

ASSIGNMENT-...
│── airflow/
│   │── dags/
│   │   │── __pycache__/
│   │   │── dataset/
│   │   │── etl/
│   │   │   │── __pycache__/
│   │   │   │── __init__.py
│   │   │   │── config.py
│   │   │   │── dbconn.py
│   │   │   │── extract.py
│   │   │   │── load.py
│   │   │   │── transform.py
│   │   │── queries/
│   │   │── etl_ml_pipeline.py
│   │   │── olist_transformed.db
│   │   │── olist.db
│── logs/
│── airflow-webserver.pid
│── airflow.cfg
│── airflow.db
│── webserver_config.py
│── sp01_airflow_env/
│── sprint01_env/
│── README.md
│── requirements_elt.txt

The Business problem

You are working for one of the largest E-commerce sites in Latam and they requested the Data Science team to analyze company data to understand better their performance in specific metrics during the years 2016-2018.

They are two main areas they want to explore, those are Revenue and Delivery.

Basically, they would like to understand how much revenue by year they got, which were the most and less popular product categories, and the revenue by state. On the other hand, it's also important to know how well the company is delivering the products sold in time and form to their users. For example, seeing how much takes to deliver a package depends on the month and the difference between the estimated delivery date and the real one.

About the data

You will consume and use data from two sources.

The first one is a Brazilian e-commerce public dataset of orders made at the Olist Store, provided as CSVs files. This is real commercial data, that has been anonymized. The dataset has information on 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allow viewing orders from multiple dimensions: from order status, price, payment, and freight performance to customer location, product attributes and finally reviews written by customers. You will find an image showing the database schema at images/data_schema.png. To get the dataset please download it from this link, extract the dataset folder from the .zip file and place it into the root project folder. See ASSIGNMENT.md, section Project Structure to validate you've placed the dataset as it's needed.

The second source is a public API: https://date.nager.at. You will use it to retrieve information about Brazil's Public Holidays and correlate that with certain metrics about the delivery of products.

Technical aspects

Because the team knows the data will come from different sources and formats, also, probably you will have to provide these kinds of reports on a monthly or annual basis. They decided to build a data pipeline (ELT) they can execute from time to time to produce the results.

The technologies involved are:

Python as the main programming language
Pandas for consuming data from CSVs files
Requests for querying the public holidays API
SQLite as a database engine
SQL as the main language for storing, manipulating, and retrieving data in our Data Warehouse
Airflow for ELT process automatization

Instalation

A requirements_elt.txt file is provided with all the needed Python libraries for running this project. For installing the dependencies just run:

$ pip install -r requirements_elt.txt

Note: We encourage you to install those inside a virtual environment.

Airflow 2.10.5 version is recomended is its own enviroment

dataset

It has all the .csvs with the information that will be used in the project.

dataset/olist_customers_dataset.csv: csv with info regarding the location of the customers.
dataset/olist_order_items_dataset.csv: csv with info regarding the shipping.
dataset/olist_order_payments_dataset.csv: csv with info regarding the payment.
dataset/olist_order_reviews_dataset.csv: csv with info regarding the clients' reviews.
dataset/olist_orders_dataset.csv: csv with info regarding the different dates of each sale's process.
dataset/olist_products_dataset.csv: csv with info regarding the details of each product.
dataset/olist_sellers_dataset.csv: csv with info regarding the location of the sellers.
dataset/product_category_name_translation.csv: csv with info regarding the translation of each category from Portuguese to English.

queries

It contains all the SQL queries and the code you will need to complete, to later create tables and plots.

queries/delivery_date_difference.sql: This query will return a table with two columns; State, and Delivery_Difference. The first one will have the letters that identify the states, and the second one the average difference between the estimated delivery date and the date when the items were actually delivered to the customer.
queries/global_ammount_order_status.sql: This query will return a table with two columns; order_status, and Amount. The first one will have the different order status classes and the second one the total amount of each.
queries/real_vs_estimated_delivered_time.sql: This query will return a table with the differences between the real and estimated delivery times by month and year. It will have different columns: month_no, with the month numbers going from 01 to 12; month, with the 3 first letters of each month (e.g. Jan, Feb); Year2016_real_time, with the average delivery time per month of 2016 (NaN if it doesn't exist); Year2017_real_time, with the average delivery time per month of 2017 (NaN if it doesn't exist); Year2018_real_time, with the average delivery time per month of 2018 (NaN if it doesn't exist); Year2016_estimated_time, with the average estimated delivery time per month of 2016 (NaN if it doesn't exist); Year2017_estimated_time, with the average estimated delivery time per month of 2017 (NaN if it doesn't exist) and Year2018_estimated_time, with the average estimated delivery time per month of 2018 (NaN if it doesn't exist).
queries/revenue_by_month_year.sql: This query will return a table with the revenue by month and year. It will have different columns: month_no, with the month numbers going from 01 to 12; month, with the 3 first letters of each month (e.g. Jan, Feb); Year2016, with the revenue per month of 2016 (0.00 if it doesn't exist); Year2017, with the revenue per month of 2017 (0.00 if it doesn't exist) and Year2018, with the revenue per month of 2018 (0.00 if it doesn't exist).
queries/revenue_per_state.sql: This query will return a table with two columns; customer_state, and Revenue. The first one will have the letters that identify the top 10 states with the most revenue and the second one the total revenue of each.
queries/top_10_least_revenue_categories.sql: This query will return a table with the top 10 least revenue categories in English, the number of orders, and their total revenue. The first column will be Category, which will contain the top 10 least revenue categories; the second one will be Num_order, with the total amount of orders of each category; and the last one will be Revenue, with the total revenue of each category.
queries/top_10_revenue_catgories.sql: This query will return a table with the top 10 revenue categories in English, the number of orders, and their total revenue. The first column will be Category, which will contain the top 10 revenue categories; the second one will be Num_order, with the total amount of orders of each category; and the last one will be Revenue, with the total revenue of each category.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
airflow		airflow
sp01_airflow_env		sp01_airflow_env
sprint01_env		sprint01_env
.gitignore		.gitignore
README.md		README.md
image.png		image.png
requirements_elt.txt		requirements_elt.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sprint project 01 - AnyoneAI ML Career

Project Structure

The Business problem

About the data

Technical aspects

Instalation

dataset

queries

About

Releases

Packages

Languages

ljmor/eltpipelline_airflow_anyoneaiproject

Folders and files

Latest commit

History

Repository files navigation

Sprint project 01 - AnyoneAI ML Career

Project Structure

The Business problem

About the data

Technical aspects

Instalation

dataset

queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages