Olist is a Brazilian department store platform which operates in the e-commerce segment (Software as Service). The service consists of management of the sales process between shopkeepers and clients, and also includes a customer satisfaction report. The advantages for the shopkeepers is a better market presence and transparent reputation metrics. The data provided by Olist contains 9 datasets which contain the following information:
1- Orders : contains info about the order is, status and timestamps of the process of its delivery.
2- Order items: contains orders ids, SKU (Stock Keeping Unit), the seller, price and shipping expense
3- Products : contains technical information about the products (dimensions and weight)
4- Order payments : contains information about payment type, installements and purchase value
5- Order reviews : contains information like review id and score
6- Sellers : contains information about the sellers location like zip code, city and state
7- Customers: gives us information about the customers location: zip code, state and city
8- Geolocation: gives us detailed information about the location of the places where the commerce occured (both customers and sellers)
9- Product category name translation : contains the English translation of some of the products sold on the plateform
The links between these datasets can be represented as follows:
For the business development process in general, and for supply chain specifically, an understanding of customer behavior and geographic conditions is a useful method to make better decisions. By extracting commonly shared demographic- and geodemographic characteristics clusters (or segments) can be defined. This allows to apply tailor-made strategies to target customers and optimize supply chain more effectively.
- Gather relevant information from the datasets
- Visually explore the dataset to understand more about the business and its trends
- Build clustering models to be able to best segment the customers of the company
- Part I : Preliminary processing and merging datasets
- Part II : Feature engineering and exploratory data analysis
- Part III : Customer clustering
- Power BI dashboards
- Kmeans (centroid based clustering)
- DBSCAN (density based clustering)
- Gaussian Mixture (distribution based clustering)
- Silhouette plot and score
- Elbow plot
- CH index
- DB index