Supermarket Basket Analysis with Markovchain, Aprioi, XGBoost and RNN
by Max Philipp, Ceyda Ugur and Vera Weidmann
M. Sc. Business Intelligence and Process Management, BSEL Berlin, Germany
The project which the repository is about is a competition posted on Kaggle.com. The kick off of the project was in May, 2017 and the time given is 3 months, meaning the deadline is the end of July, 2017. The datatables are provided by Instacart.
Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with customers' personal favorites and staples when they need them. After selecting products through the Instacart app, personal shoppers review their order and do the in-store shopping and delivery for customers.
The purpose of this project is to predict/estimate the users' next orders based on customer orders over time.
R-code can be finded in the belonging folder of this repository. These scipts also include some explanations about our approach and used commands.
- Markovchain
- Apriori
- XGBoost
- RNN (coming soon)
A comprehensive data analysis was done via the databricks community/spark. Databricks provides a Unified Analytics Platform that accelerates innovation by unifying data science, engineering and business. It is based on Hadoop Spark and is open for SQL data analysis as well as python or R. It is very easy to access the data tables and very fast to execute code. Also, because the query results are automatically visualized with only a button, it also makes the understanding of the results easier and more meaningful.
Comprehensive Data Analysis and Visualizations
Some specific visualization results are presented in the following:
The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 users. For each user, Kaggle provides between 4 and 100 of their orders, with the sequence of products purchased in each order. Moreover, the week and hour of day the order was placed are also provided, and a relative measure of time between orders.
In order to to facilitate, simplify and better understand the content and relationships between all the csv files provided by Kaggle, a schema including connections to multiple data tables have been visualized via the SQL architecture. Besides, all data types are specified. An opportunity to see the origin of each column on this schema is obtained.
As it can be seen from the schema, there are 6 csv files which describes a relational set of files for customers' orders over time. Each entity (customer, product, order, aisle, etc.) has an associated unique id. It can confidently be said that the most important tables for the project task are order_products_ prior and order_products_train. These two tables are also linked to the order and the products. In other words, products and orders table feed the order_products_ prior and order_products_train tables. The reason of why these two tables are highly important for the task is because they contain the “reordered” columns, which are a basis for predicting the next orders of customers.
We have 134 aisles at total. These aisles contain different products and are grouped by the type of these products. In our visualization which is generated by Tableau, we can obviously see that the huge amount of products are included in the aisle fresh fruits. As an assumption, we can see that people have a lot of option to choose and buy from this aisle. Accordingly, this might increase the reorder rates as well.
In the bar chart below, we have observed products which are contained in more than 64,000 baskets. These products are highly ordered by customers and therefore they are considered as “frequent”.The frequency bar chart demonstrates lots of fruits and vegetables. So we were right about our assumption by saying that “fresh fruit” aisle will probably be the aisle which customers buy the products the most. This is also about the huge amount of products that fresh fruit aisle contain. Specifically, Banana can be observed as the product which is really highly demanded.
By taking our exploratory analysis into consideration , we saw that 262464 users have reordered products which contain the word of “Organic”. We have also seen that there are 5035 organic products at total. This result does not surprise us as people's interest in bio nutrition has increased in the last few years.
There are 21 departments at total. These departments contain different amount of aisles depending on the type of the product.The “produce” department has the most amount of products. We can do the same assumption as we did in the aisle visualization. People will likely buy more products from the produce department compared to other departments because when the amount of samples increase, the likelihood also increases, meaning that we will have more findings in the dataset about the produce department. Also, if we think from the business side, the reason of why there are more products in this department is because these kind of products are highly demanded by customers.
Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme: Factorizing Personalized Markov Chainsfor Next-Basket Recommendation
Shengxian Wan, Yanyan Lan, Pengfei Wang, Jiafeng Guo, Jun Xu, Xueqi Cheng (2015): Next Basket Recommendation with Neural Networks
Jakob Aungiers (2016): LSTM Neural Network for Time Series Prediction. URL: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction