eCommerce Business Trends

Group Members

Corinne Hume - Branches: Corinne, visuals_wk2, readme_wk2, readme_wk3, presentation_wk3, presentation_notes_Corinne, readme_wk4

Elaine Ng - Branches: Elaine, technology.md, Machine_Learning, readme_wk2, readme_wk3, .gitignore, requirements.txt, presentation_notes_Elaine, readme_wk4

Luz Maria Cabral - Branches: Luz, readme_wk2, readme_wk3, presentation_notes_Luz, readme_wk4

Manupriya Sharma - Branches: Manupriya, Database, readme_wk2, readme_wk3, presentation_notes_Manupriya, readme_wk4

Presentation

Speaker notes from presentation are in the "Presentation" folder

Link to Google Slides

Project

Selected Topic: Brazilian eCommerce Business Trends

In this project, we will analyze customer orders placed between 2016-2018 from sellers on the Brazilian eCommerce platform Olist Store. From this data, we will first determine the number of customers by geographic region (looking at zip code prefix, city, and state), most popular products, number of repeat customers and purchases, and total purchases by date. After data cleaning and initial analysis, we will use machine learning to make predictions on customer behavior, providing sellers on Olist the opportunity to increase sales and customer base.

Description of our source of data:

We chose the Brazilian-eCommerce dataset from Kaggle for our analysis. This dataset contains approximately 100,000 customer orders, along with corresponding files on product information and English translations of product categories originally in Portuguese. Seller names in this dataset were anonymized and replaced with Game of Thrones House names. Nine files from the original Kaggle dataset were chosen for further analysis: olist_geolocation_dataset, olist_customers_dataset, olist_sellers_dataset, olist_product_dataset, olist_order_items_dataset, olist_orders_dataset, olist_order_payments_dataset, olist_order_reviews_dataset and product_category_name_translation.

Data Source: https://www.kaggle.com/olistbr/brazilian-ecommerce

Questions we hope to answer with our data:

With this data, we hope to answer...

What are business trends like for Olist, in general?
What data features impact review scores the most?
Can we predict review scores using machine learning?

Database

We used Postgres to create data tables in SQL. The Entity Relationship Diagram(ERD) below shows the connectivity between the 9 data tables used in our analysis.

The description of these tables is as follows:

olist_orders_dataset: This table is connected to 4 other tables. It is used to connect all the details related to an order.
olist_order_items_dataset: It contains the details of an item that had been purchased such as shipping date, price and so on.
olist_order_reviews_dataset: It contains details related to any reviews posted by the customer on a particular product that he had purchased.
olist_products_dataset: It contains related to a product such as the ID, category name and measurements.
olist_order_payments_dataset: The information contained in this table is related to the payment details associated with a particular order.
olist_customers_dataset: Details the customer base information of this firm.
olist_geolocation_dataset: It contains geographical information related to both the sellers and customers.
olist_sellers_dataset: This table contains the information related to all the sellers who have registered with this firm.
olist_product_category_name_translation: This table is connected to products database.

Individual datasets were cleaned using Pandas in a Python3 Jupyter Notebook before importing the data tables to a postgresSQL server.
SQL inner joins were used to connect all nine data tables for our machine learning model.

Link to Data_Cleaning Jupyter Notebook

Link to Schema

Using RDS on AWS with Jupyter notebooks

Step 1: Setting up the Database

We used an AWS free tier template to create our RDS database on AWS. We learned that free templates are only available for Postgres 12 and higher versions.

We made our database accessible from anywhere so it’s accessible outside of the default VPC.

Step 2: Create AWS server on postgres

Once the database was created on AWS, we created an AWS server with the name final_project on postgres.

We used the schema we created before to create all the tables again and then join them using inner joins.

The final joined database was imported to jupyter notebook and it has a total of 34 columns and 91,596 rows.

Step 3: Simplifying our data (Machine Learning)

We created a profile of our data to understand the relation between various features. We realized many features in our data did not affect the review score so we dropped those columns. For example, features like "customer_city", "customer_state", "geolocation_lat","geolocation_lng", "customer_id", "order_purchase_timestamp", "order_approved_at", "order_delivered_carrier_date", "payment_sequential" were dropped.

We confirmed our results by creating a correlation matrix:

After looking at the correlation matrix, we decided to keeping the following columns in our data:

zipcode order_status
price freight_value review_score
payment_type
payment_value product_id
product_photos_qty
product_category
seller_zip
seller_state
time_order_to_delivery
time_estimate_to_delivery

Step 4: Feature Engineering (Machine Learning)

We used the following methods to encode the catagorical variables:

lambda function
label encoder

Our final data for machine learning has 13 columns:

Step 5: Importing CSV

As a first step to connect with AWS database we imported our dataframe to csv file.

Step 6: Setting up a config file

Once psycopg2 was imported, we created a config.py file to store the details for accessing our database. To connect, we needed the following details:

Endpoint
Port
Name
User’s Name
User’s Password

Step 7: Connecting & Creating a table

After importing psycopg2 and our config file, we created a function that connects to the database and sets up a cursor. This function uses our credentials from our config.py file to create the conn_string, and uses the conn_string to create the connection to our database hosted by AWS.

Step 8: Create Table

Once we had a connection and a cursor, we wrote SQL queries in Python. PostgreSQL queries from Python using the psycopg2 library need four elements:

Establish Connection
Establish Cursor
Execute Cursor
Commit Connection

Step 9: Loading data into a Postgres table from CSV

Once we had our tables, we copied data from CSV files with psycopg2 using the copy_from() method.

Step 10: Importing data from AWS database to jupyter notebook for Machine Learning (Code included in the picture above)

We used pandas to read in the SQL database to jupyter notebook. We used the final_customers_sql file to create our machine learning models

Machine Learning:

Link to Machine Learning jupyter notebook

Goal: To create a Machine Learning model to predict review score. We converted review score to binary variable by using the following code:

(See Feature Engineering and Data Preprocessing for Machine Learning in Database section above)

Description of how data was split into training and testing sets:

The goal of the training and testing sets is to create a machine learning model that predicts review scores. We used the review score as the target column, then made it into y data. After that, we put x and y into training and testing sets to make X_train, X_test and y_train, and y_test. We split the data by making a 75/25 split where 25% of the data was used for testing. X is everything except review score and y is review score.

Resampling

Our data has more positive reviews than negative reviews, as the pie chart below shows.

We used Random undersampling to resample our data using the following code:

Model Choice:

As our target variable is binary, we choose a classification model. We created 5 different models and compared their accuracies. We got accuracy >70% for all the models except for Artificial Neural Network that gave us only 50% of accuracy. Random forest performed the best for our data and we used this model for our web app.

Summary of Machine Learning Models Performance

Linear Logistic Regression: We got accuracy of 74%. Logistic regression is easier to implement and interpret, and very efficient to train. But it is tough to obtain complex relationships using logistic regression and it over fits the model.
K-NN model: We were hoping to receive higher accuracy with K-NN but we got lower accuracy of 67% than logistic regression. One benefit of KNN algorithm is that it doesn’t require training before making predictions, new data can be added seamlessly which will not impact the accuracy of the algorithm.
Decision Tree: We got accuracy of ~92%. Decision Tree algorithm is very intuitive and easy to understand but a small change in the data can change the prediction big time.
Ensemble-Random Forest: As this combines various decision tree models, it gave us the best value of accuracy of ~98% . The Random Forest doesn’t over fit the model but it makes algorithms to run slow.

Artificial Neural Network: We were hoping to get the highest accuracy with this one but it gave us accuracy of 50%. ANN can overfit the data and takes a lot of time to run.
Here is a graph that shows the comparison of all the models we tested:

Feature Importance

We used Random Forest feature importance technique (as Random Forest has the maximum accuracy ~98% among all the models, we tried to find importance score for input features based on how useful they are at predicting a target variable). As the picture below shows, the most important feature is the "time estimate to delivery" which is the difference between estimate delivery date and actual delivery date and the least important feature is the "order status."

Predictive Web App

We used Streamlit to create an interactive web app to predict the review score for various combinations of the features. We used Random Forest classifier model to create the prediction.

Link to Predictive Web App

Dashboard

Link to dashboard on Google Slides

Link to video of dashboard interactivity

Link to Dashboard on Tableau

Visualization Tools

We used Tableau to create dashboards showing sales by product category and customer review scores by product category.

Sales Dashboard

The sales dashboard shows the most popular and least popular product categories overall, by geographic region, and by date.

Review Dashboard

The review dashboard illustrates the average customer review scores by product category and State.

Interactive elements:

Graphs that can be filtered by year
Maps that can be filtered by customer review scores, product category, and city

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eCommerce Business Trends

Group Members

Presentation

Project

Database

Using RDS on AWS with Jupyter notebooks

Step 1: Setting up the Database

Step 2: Create AWS server on postgres

Step 3: Simplifying our data (Machine Learning)

Step 4: Feature Engineering (Machine Learning)

Step 5: Importing CSV

Step 6: Setting up a config file

Step 7: Connecting & Creating a table

Step 8: Create Table

Step 9: Loading data into a Postgres table from CSV

Step 10: Importing data from AWS database to jupyter notebook for Machine Learning (Code included in the picture above)

Machine Learning:

Goal: To create a Machine Learning model to predict review score. We converted review score to binary variable by using the following code:

(See Feature Engineering and Data Preprocessing for Machine Learning in Database section above)

Feature Importance

Predictive Web App

Dashboard

Visualization Tools

Sales Dashboard

Review Dashboard

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
Dashboard		Dashboard
Database		Database
Machine_Learning		Machine_Learning
Prediction_App		Prediction_App
Presentation		Presentation
Resources		Resources
Technology.md		Technology.md
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

cmhume/eCommerce_Business_Trends

Folders and files

Latest commit

History

Repository files navigation

eCommerce Business Trends

Group Members

Presentation

Project

Database

Using RDS on AWS with Jupyter notebooks

Step 1: Setting up the Database

Step 2: Create AWS server on postgres

Step 3: Simplifying our data (Machine Learning)

Step 4: Feature Engineering (Machine Learning)

Step 5: Importing CSV

Step 6: Setting up a config file

Step 7: Connecting & Creating a table

Step 8: Create Table

Step 9: Loading data into a Postgres table from CSV

Step 10: Importing data from AWS database to jupyter notebook for Machine Learning (Code included in the picture above)

Machine Learning:

Goal: To create a Machine Learning model to predict review score. We converted review score to binary variable by using the following code:

(See Feature Engineering and Data Preprocessing for Machine Learning in Database section above)

Feature Importance

Predictive Web App

Dashboard

Visualization Tools

Sales Dashboard

Review Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages