Skip to content

matthewsparr/Northwind-Database-Hypothesis-Testing

 
 

Repository files navigation

Northwind Database Hypothesis Testing

Matt Sparr

Introduction

In this project we will be working with the Northwind Database. This is a sample database provided by Microsoft which contains sales data for a ficticious specialty foods company. We will be posing several questions for which we will develop and test hypotheses in effort to present important information for the company.

First we will import all the necessary libraries.

import sqlite3 as sql
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn import preprocessing
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sns
from scipy import stats
from sklearn import linear_model as lm
from statsmodels.graphics.factorplots import interaction_plot
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

Next we will connect to the database and grab all the table names.

conn = sql.connect('Northwind_small.sqlite')
cur = conn.cursor()
table_names = list(cur.execute('''SELECT name FROM sqlite_master WHERE TYPE = 'table'
                               ''').fetchall())
table_names
[('Employee',),
 ('Category',),
 ('Customer',),
 ('Shipper',),
 ('Supplier',),
 ('Order',),
 ('Product',),
 ('OrderDetail',),
 ('CustomerCustomerDemo',),
 ('CustomerDemographic',),
 ('Region',),
 ('Territory',),
 ('EmployeeTerritory',)]

Now we will store each table from the database as Pandas Dataframe so that they are easy to work with and analyze. There was a technical error with the 'orders' table, so a CSV file of the table was generated using an outside program.

sql_comm = []
for i in range(0,len(table_names)):
    s = "SELECT * FROM " + table_names[i][0]
    sql_comm.append(s)

employees = pd.read_sql_query(sql_comm[0], conn)
categories = pd.read_sql_query(sql_comm[1], conn)
customers = pd.read_sql_query(sql_comm[2], conn)
shippers = pd.read_sql_query(sql_comm[3], conn)

suppliers = pd.read_sql_query(sql_comm[4], conn)
products = pd.read_sql_query(sql_comm[6], conn)
order_details = pd.read_sql_query(sql_comm[7], conn)
orders = pd.read_csv('Order.csv')

cust_cust_demos = pd.read_sql_query(sql_comm[8], conn)
cust_demos = pd.read_sql_query(sql_comm[9], conn)
regions = pd.read_sql_query(sql_comm[10], conn)
territories = pd.read_sql_query(sql_comm[11], conn)
emp_territories = pd.read_sql_query(sql_comm[12], conn)

And here is a schema of the database.

Q1 -- Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?

H0: The average quantity of product ordered is the same for orders with and without a discount.

Ha: The average quantity of product ordered when a discount is given is higher or lower than for orders without a discount.

The above is our first question and both the null and alternative hypotheses. To test them will involve a two-tail test. This is because if the null hypothesis is rejected and there is a correlation between discount and order quantity, we could find that the discount amount either increases or decreases the quantity of product ordered.

To test our hypotheses we will be using the table 'order_details' and will be looking only at the columns 'Quantity' and 'Discount'. Below we can see that there are no null or missing values in either column. Also, 'Quantity' ranges from 1 to 130 and 'Discount' ranges from 0 to 0.25.

order_details[['Quantity','Discount']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2155 entries, 0 to 2154
Data columns (total 2 columns):
Quantity    2155 non-null int64
Discount    2155 non-null float64
dtypes: float64(1), int64(1)
memory usage: 33.8 KB
order_details[['Quantity','Discount']].describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Quantity Discount
count 2155.000000 2155.000000
mean 23.812993 0.056167
std 19.022047 0.083450
min 1.000000 0.000000
25% 10.000000 0.000000
50% 20.000000 0.000000
75% 30.000000 0.100000
max 130.000000 0.250000
order_details['Discount_bin'] = order_details['Discount'].apply(lambda x: 1 if x > 0 else 0)
X = ['Discount', 'No Discount']
y = [order_details['Quantity'][order_details['Discount_bin'] == 1].mean(),
     order_details['Quantity'][order_details['Discount_bin'] == 0].mean()]
plt.bar(X,y, color=('navy','grey'))
plt.text(s=round(y[1],2), x=1, y=10, color="w", horizontalalignment='center',size=18)
plt.text(s=round(y[0],2), x=0, y=10, color="w", horizontalalignment='center',size=18)
plt.ylabel('Average Quantity')
plt.title('Discount v.s. No Discount Average Quantity of Product Per Order')
Text(0.5,1,'Discount v.s. No Discount Average Quantity of Product Per Order')

png

Now we will split our order details into two sets: discount and no discount. Running a t-test on the data below then gives us a small p-value of 1.14e-10 which is less than our alpha of 0.05.

no_disc = order_details[order_details['Discount'] == 0].copy()
disc = order_details[order_details['Discount'] > 0].copy()

p = stats.ttest_ind(no_disc.Quantity, disc.Quantity)[1]
p
1.1440924523215966e-10

Since the p-value for our t-test is so low, this allows us to reject the null-hypothesis and accept instead the alternative hypothesis. From this result we can reasonably argue that having a discount does indeed have an impact on the quantity of product ordered.

For the second part of the question we will need to look at the different discount amounts and see where the size of the discount affects the quantity ordered the most. Let's start by creating a dataframe with only orders with a 'Discount' and then viewing the value counts.

q1_data = order_details[order_details['Discount'] > 0]
q1_data.Discount.value_counts()
0.05    185
0.10    173
0.20    161
0.15    157
0.25    154
0.03      3
0.02      2
0.04      1
0.06      1
0.01      1
Name: Discount, dtype: int64

Since the count of some 'Discount' values are so few compared to others, we will group those discount values together. By adjusting those values of 'Discount' (0.01, 0.02, 0.03, 0.04, and 0.06) to 0.05, we will have 5 buckets with similar value counts.

q1_data['Discount'] = q1_data['Discount'].apply(lambda x: 0.05 if (x > 0) & (x < 0.10) else x)
q1_data['Discount'].value_counts()
C:\Users\sparr\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.





0.05    193
0.10    173
0.20    161
0.15    157
0.25    154
Name: Discount, dtype: int64

For this second part of the question, we will need another null and alternative hypothesis.

H0: The average quantity of product ordered is the same for each discount level.

Ha: The average quantity of product ordered differs across the discount levels.

Since we want to test for difference of mean, we can use an analysis of variance or one-way ANOVA test to resolve our hypotheses. Since we have 5 distinct 'Discount' values we will run this ANOVA with 'Discount' as a categorical variable.

X = []
y = []
for i in q1_data['Discount'].unique():
    X.append(i)
    y.append(q1_data['Quantity'][q1_data['Discount'] == i].mean())
ax = plt.bar(X,y,width=.03)
plt.ylim(0,32)
plt.ylabel('Average Quantity')
plt.xlabel('Discount')
plt.title('Discount Level v.s. Average Quantity of Product Per Order')
rects = ax.patches
y_round = []
for i in y:
    y_round.append(round(i,2))
labels = y_round
for rect, label in zip(rects, labels):
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height + 1, label,
            ha='center', va='bottom')

png

formula = 'Quantity ~ C(Discount)'
lm = ols(formula, q1_data).fit()
table = sm.stats.anova_lm(lm, typ=1)
table
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
df sum_sq mean_sq F PR(>F)
C(Discount) 4.0 1065.701253 266.425313 0.616376 0.650947
Residual 833.0 360060.198508 432.245136 NaN NaN

The p-value of our categorical 'Discount' variable is ~0.651 as shown in the above table. Since this value is much higher than our alpha of 0.05, we fail to reject the null hypothesis. This suggests that the average quantities of product ordered across the discount levels are not significantly different.

Our findings show that having a discount on a product most likely will increase the quantity of product order but the actual amount of discount is statistically insignificant.

Q2 -- Are there certain products that are better to discount than others in order to increase the quantity ordered?

As a follow-up to our initial question, we will investigate whether or not the category of product intereacts with whether or not a discount is applied in regards to quantity of product ordered.

H0: There is no interaction between category of product and whether or not a discount is applied.

Ha: There is some interaction between category of product and whether or not a discount is applied.

To test our hypothesis we will need data from more than one table so we will need to perform a couple merges. We will first join the 'products' table to the 'order-details' table as 'prod_ord_details'; and then merge that table with the 'categories' table.

prod_ord_details = pd.merge(products, 
                  order_details,
                  left_on='Id',
                  right_on='ProductId',
                  how='left')
prod_cat = pd.merge(prod_ord_details, 
                  categories,
                  left_on='CategoryId',
                  right_on='Id',
                  how='left')
prod_cat.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id_x ProductName SupplierId CategoryId QuantityPerUnit UnitPrice_x UnitsInStock UnitsOnOrder ReorderLevel Discontinued Id_y OrderId ProductId UnitPrice_y Quantity Discount Discount_bin Id CategoryName Description
0 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0 10285/1 10285 1 14.4 45 0.20 1 1 Beverages Soft drinks, coffees, teas, beers, and ales
1 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0 10294/1 10294 1 14.4 18 0.00 0 1 Beverages Soft drinks, coffees, teas, beers, and ales
2 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0 10317/1 10317 1 14.4 20 0.00 0 1 Beverages Soft drinks, coffees, teas, beers, and ales
3 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0 10348/1 10348 1 14.4 15 0.15 1 1 Beverages Soft drinks, coffees, teas, beers, and ales
4 1 Chai 1 1 10 boxes x 20 bags 18.0 39 0 10 0 10354/1 10354 1 14.4 12 0.00 0 1 Beverages Soft drinks, coffees, teas, beers, and ales

Viewing the head of the merged table we can see lots of variables we don't need. We will create a new DataFrame with only the categories we need: 'Quantity', 'Discount', and 'CategoryName'.

q2_data = prod_cat[['Quantity', 'Discount', 'CategoryName']]
q2_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Quantity Discount CategoryName
0 45 0.20 Beverages
1 18 0.00 Beverages
2 20 0.00 Beverages
3 15 0.15 Beverages
4 12 0.00 Beverages

Since we are concerned with the interaction between category and whether or not a discount is applied, we will need to modify our 'Discount' column to make it binary. We will do this by making all values of discount greater than zero equal to 1 and leaving no discount instances as 0.

q2_data['Discount'] = q2_data['Discount'].apply(lambda x: 1 if x > 0 else 0)
C:\Users\sparr\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Since we want to test the impact of a variable ('Discount') on the dependent variable ('Quantity') for multiple groups ('CategoryName'), we will use analysis of variance or ANOVA to test our hypothesis.

formula = 'Quantity ~ C(CategoryName)*C(Discount)'
lm = ols(formula, q2_data).fit()
table = sm.stats.anova_lm(lm, typ=2)
table
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sum_sq df F PR(>F)
C(CategoryName) 1118.470015 7.0 0.449689 8.708672e-01
C(Discount) 14729.606130 1.0 41.454988 1.484744e-10
C(CategoryName):C(Discount) 3357.463105 7.0 1.349892 2.226259e-01
Residual 760020.189876 2139.0 NaN NaN

From the above table we can see that the interaction of 'CategoryName' and 'Discount' has a p-value of ~0.223. Since this is less than our alpha of 0.05, we fail to reject the null hypothesis. From these results, it appears there is no significant difference in quantity of product ordered across the different combinations of category and whether or not a discount is applied.

We can also view an interaction plot of our variables to show this lack of interaction.

plt.figure(figsize=(12,10))
fig = interaction_plot(q2_data['Discount'],q2_data['CategoryName'],q2_data['Quantity'], ms=10, ax=plt.gca())
plt.xticks(np.arange(0,1.5,step=1), ('No Discount', 'Discount'))
plt.xlabel('')
Text(0.5,0,'')

png

We can see from the above graph that the lines for each of the category are mostly parallel with each other which implies little to no interaction with the other independent variable 'Discount'. The categories of 'Confections' and 'Grains/Cereals' do have slightly different slopes than the other categories but according to our ANOVA results, this difference is not significant.

Our findings signify that there is not sufficient evidence that certain categories benefit more than others from having a discount applied versus not providing a discount. There may be some slight differences between certain categories but the differences overall are not statistically significant.

Q3 -- Is there an optimal discount level based on the price of the product?

For this question we are wondering whether or not there is an interaction between 'Discount' and 'UnitPrice'. From this we can learn whether a certain discount level can lead to a higher 'Quantity' of product ordered depending on the product's price.

H0: There is no interaction between the discount level and unit price in regards to quantity.

Ha: There is some interaction between discount level and unit price in regards to quantity.

To answer this question we will first need to create a relevant data table. Since we are only concerned with 'UnitPrice', 'Discount' and 'Quantity', we can simply grab those from the 'order_details' table.

q3_data = order_details[['UnitPrice', 'Discount', 'Quantity']].copy()

From the first question, we know that there are a couple 'Discount' values with few instances.

q3_data['Discount'].value_counts()
0.00    1317
0.05     185
0.10     173
0.20     161
0.15     157
0.25     154
0.03       3
0.02       2
0.01       1
0.04       1
0.06       1
Name: Discount, dtype: int64

Since the values with low value counts are all close to 0.05, we will modify those few 'Discount' levels and set them equal to 0.05.

q3_data['Discount'] = q3_data['Discount'].apply(lambda x: 0.05 if (x > 0) & (x < 0.10) else x)
q3_data['Discount'].value_counts()
0.00    1317
0.05     193
0.10     173
0.20     161
0.15     157
0.25     154
Name: Discount, dtype: int64
colors = ['red','blue','green','yellow','purple','orange']
discounts = q3_data['Discount'].unique()
for i in range(0,len(discounts)):
    data = q3_data[q3_data['Discount'] == discounts[i]]
    c=np.random.rand(3,1)
    sns.regplot(data['UnitPrice'],data['Quantity'],color=colors[i])
    plt.title(str(discounts[i]) + ' Discount')
    plt.show()
C:\Users\sparr\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

png

png

png

png

png

Since we are looking to analyze the interaction between a categorical variable 'Discount' and a continuous variable 'UnitPrice' we will use analysis of covariance or ANCOVA to answer our hypothesis.

formula = 'Quantity ~ C(Discount)*UnitPrice'
lm = ols(formula, q3_data).fit()
lm.summary()
OLS Regression Results
Dep. Variable: Quantity R-squared: 0.023
Model: OLS Adj. R-squared: 0.018
Method: Least Squares F-statistic: 4.592
Date: Mon, 03 Dec 2018 Prob (F-statistic): 6.18e-07
Time: 14:58:14 Log-Likelihood: -9380.0
No. Observations: 2155 AIC: 1.878e+04
Df Residuals: 2143 BIC: 1.885e+04
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 21.6653 0.692 31.297 0.000 20.308 23.023
C(Discount)[T.0.05] 5.6568 1.839 3.076 0.002 2.051 9.263
C(Discount)[T.0.1] 4.8518 1.981 2.449 0.014 0.967 8.736
C(Discount)[T.0.15] 6.3929 2.720 2.350 0.019 1.058 11.728
C(Discount)[T.0.2] 3.2313 2.154 1.500 0.134 -0.992 7.455
C(Discount)[T.0.25] 4.5212 2.241 2.018 0.044 0.127 8.916
UnitPrice 0.0019 0.017 0.109 0.913 -0.032 0.036
C(Discount)[T.0.05]:UnitPrice -0.0149 0.038 -0.395 0.693 -0.089 0.059
C(Discount)[T.0.1]:UnitPrice -0.0528 0.050 -1.055 0.291 -0.151 0.045
C(Discount)[T.0.15]:UnitPrice 0.0122 0.096 0.128 0.898 -0.175 0.200
C(Discount)[T.0.2]:UnitPrice 0.0886 0.062 1.430 0.153 -0.033 0.210
C(Discount)[T.0.25]:UnitPrice 0.0709 0.056 1.270 0.204 -0.039 0.180
Omnibus: 787.916 Durbin-Watson: 1.636
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3092.282
Skew: 1.769 Prob(JB): 0.00
Kurtosis: 7.682 Cond. No. 281.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From the above results we can see that the p-values for each of our interaction terms are all greater than our alpha value. We thus fail to reject our null hypothesis. This suggests that there is no significant interaction between 'Discount' level and 'UnitPrice'.

The low R-squared value of 0.008 for the model also shows how poor of a fit a regression using only 'Discount' level and 'UnitPrice' are in predicting the 'Quantity' of product ordered.

Our findings suggest there is no optimal discount level given the price of a product.

Q4 -- Do some employees perform better in terms of average order price than others?

H0: The average amount spent per order is the same across all employees.

H0: The average amount spent per order varies amongst employees.

To investigate this question we will need the variables of 'EmployeeId', 'Quantity', 'Discount', and 'UnitPrice'. To get these variables we will need to merge the 'orders' table with the 'order_details' table.

q4_data = pd.merge(orders, 
                  order_details,
                  left_on='Id',
                  right_on='OrderId',
                  how='left')
q4_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Id_x CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ... ShipRegion ShipPostalCode ShipCountry Id_y OrderId ProductId UnitPrice Quantity Discount Discount_bin
0 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye ... Western Europe 51100 France 10248/11 10248 11 14.0 12 0.0 0
1 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye ... Western Europe 51100 France 10248/42 10248 42 9.8 10 0.0 0
2 10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 32.38 Vins et alcools Chevalier 59 rue de l'Abbaye ... Western Europe 51100 France 10248/72 10248 72 34.8 5 0.0 0
3 10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 ... Western Europe 44087 Germany 10249/14 10249 14 18.6 9 0.0 0
4 10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 11.61 Toms Spezialitäten Luisenstr. 48 ... Western Europe 44087 Germany 10249/51 10249 51 42.4 40 0.0 0

5 rows Ă— 21 columns

Now we will need to create a new column named 'Sale' which will be the total dollar amount spent on each order. We will find this by using the 'UnitPrice', 'Quantity', and 'Discount' in the below formula.

q4_data['Sale'] = q4_data['Quantity']*(q4_data['UnitPrice']*(1 - q4_data['Discount']))

We can now initially view the average sale amount for each employee.

X = q4_data['EmployeeId']
y = []
for i in q4_data['EmployeeId']:
    y.append(q4_data['Sale'][q4_data['EmployeeId'] == i].mean())
ax = plt.bar(X,y)
plt.xlabel('Employee Id')
plt.ylabel('Sale Amount')
plt.title('Average Sale Amount Across Employees')
plt.xticks(np.arange(0, 10, step=1))

rects = ax.patches
y_round = []
for i in y:
    y_round.append(round(i,2))
labels = y_round
for rect, label in zip(rects, labels):
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height + 1, label,
            ha='center', va='bottom', size=15)
plt.show()

png

From the above bar graph, we can see that the average sale amount does vary across the employees. To get a better idea of whether this variance is significant or not we can use an ANOVA.

formula = 'Sale ~ C(EmployeeId)'
lm = ols(formula, q4_data).fit()
table = sm.stats.anova_lm(lm, typ=1)
table
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
df sum_sq mean_sq F PR(>F)
C(EmployeeId) 8.0 1.472611e+07 1.840763e+06 1.966553 0.046929
Residual 2146.0 2.008732e+09 9.360354e+05 NaN NaN

From the above table we can see that we got a p-value of ~0.0469 which is below our alpha of 0.05. This suggests that there is a statistically significant difference in sale amount across the employees. To further investigate and find out where this difference exists, we can run a Tukey HSD to determine if any certain employee performs better than another employee.

mc = MultiComparison(q4_data['Sale'], q4_data['EmployeeId'])
result = mc.tukeyhsd()
print(result)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  1      2     134.1944 -117.9953 386.3841 False 
  1      3     74.9821  -157.9727 307.9368 False 
  1      4     -2.3316  -220.6011 215.9379 False 
  1      5     31.1346  -290.2426 352.5118 False 
  1      6    -116.8745 -399.4868 165.7378 False 
  1      7     150.9404 -127.3188 429.1996 False 
  1      8     -68.9018 -315.6068 177.8032 False 
  1      9     165.6717 -166.7309 498.0742 False 
  2      3     -59.2123 -315.2499 196.8252 False 
  2      4     -136.526 -379.2787 106.2267 False 
  2      5    -103.0598 -441.5426 235.423  False 
  2      6    -251.0689 -552.9911 50.8533  False 
  2      7      16.746  -281.1054 314.5975 False 
  2      8    -203.0962  -471.705 65.5126  False 
  2      9     31.4773  -317.4909 380.4454 False 
  3      4     -77.3137 -300.0179 145.3906 False 
  3      5     -43.8475  -368.253 280.558  False 
  3      6    -191.8566 -477.9078 94.1946  False 
  3      7     75.9584  -205.7929 357.7096 False 
  3      8    -143.8839  -394.521 106.7533 False 
  3      9     90.6896  -244.6417 426.0208 False 
  4      5     33.4662  -280.5602 347.4926 False 
  4      6    -114.5429 -388.7672 159.6814 False 
  4      7     153.272  -116.4638 423.0079 False 
  4      8     -66.5702  -303.62  170.4797 False 
  4      9     168.0033 -157.2977 493.3042 False 
  5      6    -148.0091 -509.7282  213.71  False 
  5      7     119.8058 -238.5225 478.1341 False 
  5      8    -100.0364 -434.4528 234.3801 False 
  5      9     134.5371 -267.2868 536.361  False 
  6      7     267.815   -56.1998 591.8297 False 
  6      8     47.9727  -249.3836 345.329  False 
  6      9     282.5462  -89.0031 654.0954 False 
  7      8    -219.8422 -513.0644 73.3799  False 
  7      9     14.7312  -353.5178 382.9802 False 
  8      9     234.5735 -110.4519 579.5988 False 
-------------------------------------------------

Keeping our family-wise error rate at 0.05, we can see that out of all possible comparisons of employees, none of them are significant enough to allow us to reject the null hypothesis and say that one employee performs better than another.

Our findings suggest there is some significant variance in average sale amount across employees but we cannot definitively say any one employee performs better than another in a statistically significant way.

Q5 -- Are some employees better at others at selling a certain category of product?

As a follow up to Q4, we can investigate whether or not some employees have significantly better sale amounts for a particular category of product when compared to other employees. To answer this question we will use the metric of sale price and whether or not there is an interaction between 'EmployeeId' and 'CategoryName' of each order.

H0: There is no interaction between employee and category of product in regards to sale price of orders.

Ha: There is some interaction between employee and category of product in regards to sale price of orders.

For this question we will need the variables 'EmployeeId', 'CategoryName', 'Quantity', 'UnitPrice', and 'Discount'. To get these variables into one table we will merge the tables 'orders', 'order_details', 'products', and 'categories'. We will then need to create a column 'Sale' as in the previous question.

q5_data = pd.merge(orders, 
                  order_details,
                  left_on='Id',
                  right_on='OrderId',
                  how='left')
q5_data = pd.merge(q5_data, 
                  products,
                  left_on='ProductId',
                  right_on='Id',
                  how='left')
q5_data = pd.merge(q5_data, 
                  categories,
                  left_on='CategoryId',
                  right_on='Id',
                  how='left')
q5_data = q5_data.rename(index=str, columns={"UnitPrice_x": "UnitPrice"})
q5_data['Sale'] = q5_data['Quantity']*(q5_data['UnitPrice']*(1 - q5_data['Discount']))
q5_data = q5_data[['EmployeeId', 'CategoryName', 'Sale']]
q5_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
EmployeeId CategoryName Sale
0 5 Dairy Products 168.0
1 5 Grains/Cereals 98.0
2 5 Dairy Products 174.0
3 6 Produce 167.4
4 6 Produce 1696.0

Since we are concerned with the interaction of two categorical variables, 'EmployeeId' and 'CategoryName', we can use a two-way ANOVA to see if that interaction is significant.

formula = 'Sale ~ C(EmployeeId)*C(CategoryName)'
lm = ols(formula, q5_data).fit()
table = sm.stats.anova_lm(lm, typ=2)
table
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sum_sq df F PR(>F)
C(EmployeeId) 1.554216e+07 8.0 2.089979 3.365327e-02
C(CategoryName) 4.721500e+07 7.0 7.256091 1.293990e-08
C(EmployeeId):C(CategoryName) 2.523537e+07 56.0 0.484777 9.995606e-01
Residual 1.936282e+09 2083.0 NaN NaN
q5_data.pivot_table(values='Sale', index='CategoryName', columns='EmployeeId').plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x2cf936dfda0>

png

From the above table we can see that the interaction of 'EmployeeId' and 'CategoryName' has a p-value of ~0.9996. This is much higher than our alpha of 0.05 so we thus fail to reject the null hypothesis. This result suggests that there is not significant interaction between 'EmployeeId' and 'CategoryName' in regards to sale amount.

Although our ANOVA gave us insignificant results, we can still do further testing just to be sure there is no difference amongst employees' sale performance for each category. To do this we can run a Tukey HSD for each category, comparing all the employees against each other.

results = []
for i in q5_data['CategoryName'].unique():
    mc = MultiComparison(q5_data['Sale'][q5_data['CategoryName'] == i], q5_data['EmployeeId'][q5_data['CategoryName'] == i])
    result = mc.tukeyhsd()
    results.append(result)
    print (i)
    print(result)
Dairy Products
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  1      2      107.26  -339.3278 553.8478 False 
  1      3     79.5422  -322.5645 481.6489 False 
  1      4     67.0929  -328.7055 462.8913 False 
  1      5     153.4656 -315.7281 622.6593 False 
  1      6    -117.3012 -552.6946 318.0921 False 
  1      7     336.828  -132.3657 806.0218 False 
  1      8     66.4318  -388.5307 521.3942 False 
  1      9     289.8453  -216.036 795.7266 False 
  2      3     -27.7179 -495.6441 440.2083 False 
  2      4     -40.1671 -502.6836 422.3493 False 
  2      5     46.2056  -480.4896 572.9008 False 
  2      6    -224.5613 -721.3837 272.2611 False 
  2      7     229.568  -297.1272 756.2633 False 
  2      8     -40.8283 -554.8866 473.2301 False 
  2      9     182.5853 -377.0405 742.2111 False 
  3      4     -12.4493 -432.1761 407.2776 False 
  3      5     73.9235  -415.6243 563.4712 False 
  3      6    -196.8434 -654.0979 260.4111 False 
  3      7     257.2859 -232.2619 746.8336 False 
  3      8     -13.1104 -489.0359 462.8151 False 
  3      9     210.3031 -314.5113 735.1176 False 
  4      5     86.3727  -398.0068 570.7523 False 
  4      6    -184.3941 -636.1111 267.3228 False 
  4      7     269.7351 -214.6444 754.1147 False 
  4      8     -0.6611  -471.2689 469.9466 False 
  4      9     222.7524 -297.2445 742.7493 False 
  5      6    -270.7669 -788.0043 246.4706 False 
  5      7     183.3624  -362.632 729.3569 False 
  5      8     -87.0339 -620.8485 446.7808 False 
  5      9     136.3797 -441.4465 714.2059 False 
  6      7     454.1293  -63.1082 971.3667 False 
  6      8     183.733  -320.6306 688.0966 False 
  6      9     407.1465 -143.5873 957.8804 False 
  7      8    -270.3963 -804.2109 263.4183 False 
  7      9     -46.9827  -624.809 530.8435 False 
  8      9     223.4135 -342.9178 789.7449 False 
-------------------------------------------------
Grains/Cereals
 Multiple Comparison of Means - Tukey HSD,FWER=0.05
===================================================
group1 group2  meandiff   lower      upper   reject
---------------------------------------------------
  1      2     10.6746  -477.0916   498.4409 False 
  1      3     170.7809 -287.2409   628.8027 False 
  1      4     87.7228  -353.7781   529.2236 False 
  1      5     44.3686  -623.5328   712.2701 False 
  1      6     381.0867  -225.565   987.7383 False 
  1      7     32.5617  -534.1715   599.2948 False 
  1      8     22.7097  -469.1432   514.5627 False 
  1      9     11.9617  -1022.7468 1046.6701 False 
  2      3     160.1062 -264.2103   584.4227 False 
  2      4     77.0481  -329.3796   483.4758 False 
  2      5      33.694  -611.5601   678.9481 False 
  2      6     370.412  -211.2122   952.0362 False 
  2      7      21.887  -517.9713   561.7454 False 
  2      8     12.0351  -448.5947   472.6649 False 
  2      9      1.287   -1018.9493 1021.5234 False 
  3      4     -83.0581 -453.2628   287.1466 False 
  3      5    -126.4122  -749.486   496.6615 False 
  3      6     210.3058 -346.6096   767.2212 False 
  3      7    -138.2192 -651.3615   374.9231 False 
  3      8    -148.0711 -577.0791   280.9369 False 
  3      9    -158.8192 -1165.1741  847.5357 False 
  4      5     -43.3541  -654.386   567.6777 False 
  4      6     293.3639 -250.0455   836.7733 False 
  4      7     -55.1611 -553.6127   443.2906 False 
  4      8     -65.013  -476.3363   346.3103 False 
  4      9     -75.7611 -1074.7051  923.1829 False 
  5      6     336.7181 -402.5134  1075.9495 False 
  5      7     -11.8069 -718.6474   695.0335 False 
  5      8     -21.6589 -670.0077    626.69  False 
  5      9     -32.4069 -1150.0198 1085.2059 False 
  6      7     -348.525 -997.7994   300.7494 False 
  6      8    -358.3769 -943.4326   226.6787 False 
  6      9     -369.125 -1451.249   712.999  False 
  7      8     -9.8519  -553.4054   533.7016 False 
  7      9      -20.6   -1080.8607 1039.6607 False 
  8      9     -10.7481 -1032.9445 1011.4483 False 
---------------------------------------------------
Produce
 Multiple Comparison of Means - Tukey HSD,FWER=0.05 
====================================================
group1 group2  meandiff    lower      upper   reject
----------------------------------------------------
  1      2     77.5786   -768.6926   923.8498 False 
  1      3     93.5953   -691.2044   878.3949 False 
  1      4     -16.3322  -691.2302   658.5658 False 
  1      5      7.1073   -896.4631   910.6777 False 
  1      6     -95.3366  -824.3614   633.6883 False 
  1      7     491.0253  -448.8058  1430.8564 False 
  1      8     47.2376   -721.4264   815.9016 False 
  1      9    -546.3885  -2341.6008 1248.8239 False 
  2      3     16.0167   -933.9204   965.9538 False 
  2      4     -93.9108  -955.2803   767.4586 False 
  2      5     -70.4713  -1120.6663  979.7236 False 
  2      6    -172.9152  -1077.3195  731.4892 False 
  2      7     413.4467  -668.1043  1494.9976 False 
  2      8     -30.341   -966.9916   906.3096 False 
  2      9    -623.9671  -2497.2683 1249.3341 False 
  3      4    -109.9275  -910.9848   691.1298 False 
  3      5     -86.488   -1087.8096  914.8336 False 
  3      6    -188.9318  -1036.0934  658.2297 False 
  3      7      397.43   -636.7305  1431.5905 False 
  3      8     -46.3577   -927.862   835.1467 False 
  3      9    -639.9837  -2486.3296 1206.3621 False 
  4      5     23.4395   -894.2869   941.1659 False 
  4      6     -79.0043  -825.5025   667.4938 False 
  4      7     507.3575  -446.0914  1460.8064 False 
  4      8     63.5698   -721.6859   848.8256 False 
  4      9    -530.0562  -2332.4352 1272.3227 False 
  5      6    -102.4438  -1060.6775  855.7898 False 
  5      7     483.918   -643.0323  1610.8683 False 
  5      8     40.1303   -948.5955  1028.8562 False 
  5      9    -553.4957  -2453.3699 1346.3784 False 
  6      7     586.3618  -406.1374  1578.8611 False 
  6      8     142.5742  -689.6617   974.8101 False 
  6      9    -451.0519  -2274.3894 1372.2856 False 
  7      8    -443.7877  -1465.7572  578.1819 False 
  7      9    -1037.4138 -2954.7987  879.9712 False 
  8      9    -593.6261  -2433.1714 1245.9192 False 
----------------------------------------------------
Seafood
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  1      2     28.2095  -254.2287 310.6478 False 
  1      3     54.3343  -197.1757 305.8444 False 
  1      4     -13.3399 -250.1463 223.4665 False 
  1      5     69.4645   -353.449 492.378  False 
  1      6    -126.3323 -465.6826 213.0181 False 
  1      7     -98.5019 -426.7906 229.7869 False 
  1      8     -83.7759 -363.8337 196.282  False 
  1      9     19.6668  -332.6096 371.9431 False 
  2      3     26.1248  -261.2258 313.4754 False 
  2      4     -41.5495 -316.1225 233.0236 False 
  2      5      41.255  -403.9066 486.4166 False 
  2      6    -154.5418 -521.2457 212.1621 False 
  2      7    -126.7114 -483.2035 229.7806 False 
  2      8    -111.9854 -424.6281 200.6573 False 
  2      9     -8.5428  -387.2401 370.1546 False 
  3      4     -67.6743 -310.3186  174.97  False 
  3      5     15.1302  -411.0796  441.34  False 
  3      6    -180.6666 -524.1163 162.7831 False 
  3      7    -152.8362 -485.3607 179.6882 False 
  3      8    -138.1102 -423.1214 146.901  False 
  3      9     -34.6676 -390.8945 321.5594 False 
  4      5     82.8045  -334.8973 500.5062 False 
  4      6    -112.9923 -445.8251 219.8404 False 
  4      7     -85.162  -406.7089 236.385  False 
  4      8     -70.4359 -342.5597 201.6878 False 
  4      9     33.0067  -312.9956 379.009  False 
  5      6    -195.7968 -679.0733 287.4797 False 
  5      7    -167.9664 -643.5408 307.608  False 
  5      8    -153.2404 -596.8956 290.4148 False 
  5      9     -49.7978 -542.2367 442.6412 False 
  6      7     27.8304  -375.2501 430.9109 False 
  6      8     42.5564  -322.3173  407.43  False 
  6      9     145.999  -276.8473 568.8453 False 
  7      8      14.726  -339.8831 369.3351 False 
  7      9     118.1686 -295.8529 532.1902 False 
  8      9     103.4426 -273.4827 480.368  False 
-------------------------------------------------
Condiments
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=================================================
group1 group2  meandiff   lower    upper   reject
-------------------------------------------------
  1      2     181.3082 -212.4825 575.0989 False 
  1      3     40.4459  -337.1582  418.05  False 
  1      4     80.6385  -257.4187 418.6958 False 
  1      5    -103.0576 -677.4081 471.293  False 
  1      6    -127.5721 -583.1084 327.9642 False 
  1      7      5.099   -410.2946 420.4926 False 
  1      8     50.4522  -320.4795 421.384  False 
  1      9     237.5666 -217.9698 693.1029 False 
  2      3    -140.8623 -543.7533 262.0287 False 
  2      4    -100.6696 -466.7557 265.4165 False 
  2      5    -284.3657 -875.6481 306.9167 False 
  2      6    -308.8802 -785.5874 167.8269 False 
  2      7    -176.2092 -614.7159 262.2976 False 
  2      8    -130.8559 -527.5002 265.7884 False 
  2      9     56.2584  -420.4487 532.9655 False 
  3      4     40.1927  -308.4229 388.8082 False 
  3      5    -143.5034 -724.1312 437.1243 False 
  3      6     -168.018 -631.4437 295.4078 False 
  3      7     -35.3469 -459.3775 388.6837 False 
  3      8     10.0064  -370.5727 390.5855 False 
  3      9     197.1207  -266.305 660.5465 False 
  4      5    -183.6961 -739.4177 372.0255 False 
  4      6    -208.2106 -640.0222 223.601  False 
  4      7     -75.5395 -464.7693 313.6902 False 
  4      8     -30.1863 -371.5633 311.1908 False 
  4      9     156.9281 -274.8836 588.7397 False 
  5      6     -24.5145 -658.5943 609.5652 False 
  5      7     108.1566 -497.7276 714.0407 False 
  5      8     153.5098  -422.801 729.8206 False 
  5      9     340.6241 -293.4556 974.7039 False 
  6      7     132.6711 -362.0313 627.3734 False 
  6      8     178.0243  -279.981 636.0297 False 
  6      9     365.1387 -163.7204 893.9977 False 
  7      8     45.3532  -372.7465 463.453  False 
  7      9     232.4676 -262.2348 727.1699 False 
  8      9     187.1143  -270.891 645.1197 False 
-------------------------------------------------
Confections
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==================================================
group1 group2  meandiff   lower     upper   reject
--------------------------------------------------
  1      2     102.8602 -311.4633  517.1837 False 
  1      3     136.4252 -233.8828  506.7332 False 
  1      4    -119.0505 -462.7078  224.6068 False 
  1      5    -189.5058 -748.5375  369.5258 False 
  1      6    -211.9147  -688.129  264.2997 False 
  1      7      27.581  -422.9348  478.0968 False 
  1      8      6.5052  -385.9962  399.0067 False 
  1      9     109.3148 -482.6634  701.2931 False 
  2      3      33.565  -386.8429  453.9728 False 
  2      4    -221.9108 -619.0444  175.2228 False 
  2      5    -292.3661 -885.7725  301.0403 False 
  2      6    -314.7749 -830.9095  201.3597 False 
  2      7     -75.2792 -567.8026  417.2442 False 
  2      8     -96.355   -536.437  343.727  False 
  2      9      6.4546  -618.0876  630.9968 False 
  3      4    -255.4757 -606.4446  95.4931  False 
  3      5     -325.931 -889.4869  237.6249 False 
  3      6    -348.3399 -829.8572  133.1775 False 
  3      7    -108.8442 -564.9618  347.2735 False 
  3      8    -129.9199 -528.8387  268.9988 False 
  3      9     -27.1104 -623.3629  569.1421 False 
  4      5     -70.4553 -616.8687  475.9581 False 
  4      6     -92.8641 -554.2005  368.4723 False 
  4      7     146.6316 -288.1277  581.3909 False 
  4      8     125.5558  -248.755  499.8665 False 
  4      9     228.3654 -351.7117  808.4424 False 
  5      6     -22.4088 -660.5682  615.7505 False 
  5      7     217.0869 -402.1318  836.3055 False 
  5      8     196.0111 -382.3699  774.3921 False 
  5      9     298.8207 -429.8119 1027.4533 False 
  6      7     239.4957  -306.119  785.1104 False 
  6      8     218.4199  -280.367  717.2068 False 
  6      9     321.2295 -345.9805  988.4395 False 
  7      8     -21.0758 -495.3887  453.2371 False 
  7      9     81.7338  -567.3837  730.8513 False 
  8      9     102.8096 -507.4743  713.0935 False 
--------------------------------------------------
Beverages
 Multiple Comparison of Means - Tukey HSD,FWER=0.05
===================================================
group1 group2  meandiff   lower      upper   reject
---------------------------------------------------
  1      2     117.7496 -923.6376  1159.1369 False 
  1      3     -66.2209 -1018.8112  886.3694 False 
  1      4    -105.8798 -1020.5407  808.7811 False 
  1      5    -252.8214 -1591.7479 1086.1051 False 
  1      6     -450.787 -1645.1047  743.5307 False 
  1      7      0.1171  -1113.1727 1113.4069 False 
  1      8    -425.7177 -1431.4914  580.056  False 
  1      9     41.7839  -1233.6498 1317.2176 False 
  2      3    -183.9706 -1214.6766  846.7355 False 
  2      4    -223.6294 -1219.3861  772.1272 False 
  2      5     -370.571 -1766.1537 1025.0116 False 
  2      6    -568.5366 -1826.0427  688.9695 False 
  2      7    -117.6325 -1298.4547 1063.1897 False 
  2      8    -543.4673 -1623.5171  536.5824 False 
  2      9     -75.9658 -1410.7533 1258.8218 False 
  3      4     -39.6589  -942.14    862.8223 False 
  3      5    -186.6005 -1517.2363 1144.0353 False 
  3      6     -384.566 -1569.5818  800.4497 False 
  3      7     66.3381  -1036.9669  1169.643 False 
  3      8    -359.4968 -1354.2069  635.2134 False 
  3      9     108.0048 -1158.7228 1374.7324 False 
  4      5    -146.9416 -1450.6931 1156.8099 False 
  4      6    -344.9072 -1499.6533  809.839  False 
  4      7     105.9969 -964.7309  1176.7248 False 
  4      8    -319.8379 -1278.287   638.6112 False 
  4      9     147.6637 -1090.7931 1386.1204 False 
  5      6    -197.9656 -1711.0942 1315.1631 False 
  5      7     252.9385 -1197.0882 1702.9653 False 
  5      8    -172.8963 -1542.1093 1196.3167 False 
  5      9     294.6053 -1283.3347 1872.5452 False 
  6      7     450.9041 -866.7637  1768.5718 False 
  6      8     25.0693  -1203.1061 1253.2447 False 
  6      9     492.5708 -964.6749  1949.8166 False 
  7      8    -425.8348 -1575.3714  723.7017 False 
  7      9     41.6667  -1349.9455  1433.279 False 
  8      9     467.5016 -839.6905  1774.6937 False 
---------------------------------------------------
Meat/Poultry
 Multiple Comparison of Means - Tukey HSD,FWER=0.05
===================================================
group1 group2  meandiff   lower      upper   reject
---------------------------------------------------
  1      2     547.4458 -600.0197  1694.9112 False 
  1      3     374.7761 -837.8041  1587.3564 False 
  1      4     280.3793 -792.3319  1353.0905 False 
  1      5     442.843  -1039.3206 1925.0065 False 
  1      6     298.8298 -1233.934  1831.5936 False 
  1      7     810.2421 -527.6638  2148.1481 False 
  1      8     81.5976  -1089.0701 1252.2653 False 
  1      9     844.5711 -1017.7151 2706.8574 False 
  2      3    -172.6696 -1374.5571 1029.2179 False 
  2      4    -267.0664 -1327.6758  793.5429 False 
  2      5    -104.6028 -1578.0313 1368.8257 False 
  2      6    -248.6159 -1772.9347 1275.7028 False 
  2      7     262.7964 -1065.4261 1591.0189 False 
  2      8    -465.8482 -1625.4367  693.7404 False 
  2      9     297.1254 -1558.2163 2152.4671 False 
  3      4     -94.3968 -1225.1337 1036.3401 False 
  3      5     68.0668  -1456.6185 1592.7521 False 
  3      6     -75.9463 -1649.8654 1497.9728 False 
  3      7     435.466  -949.3982  1820.3302 False 
  3      8    -293.1785 -1517.2371   930.88  False 
  3      9     469.795  -1426.5084 2366.0984 False 
  4      5     162.4636 -1253.525  1578.4523 False 
  4      6     18.4505  -1450.4198 1487.3208 False 
  4      7     529.8628 -734.3394  1794.0651 False 
  4      8    -198.7817 -1284.4511  886.8876 False 
  4      9     564.1918 -1245.8704  2374.254 False 
  5      6    -144.0131 -1933.8956 1645.8694 False 
  5      7     367.3992 -1258.7333 1993.5316 False 
  5      8    -361.2454 -1852.8141 1130.3234 False 
  5      9     401.7282 -1677.3159 2480.7723 False 
  6      7     511.4123 -1160.9701 2183.7947 False 
  6      8    -217.2322 -1759.0926 1324.6282 False 
  6      9     545.7413 -1569.6737 2661.1563 False 
  7      8    -728.6445 -2076.9623  619.6733 False 
  7      9      34.329  -1944.4605 2013.1185 False 
  8      9     762.9735 -1106.8068 2632.7539 False 
---------------------------------------------------

As to be expected, none of our comparisons produced results that allow us to reject the null hypothesis.

Our findings suggest that there are no cases where an employee performs significantly better than other employees for a particular category of product.

Q6 -- For which categories of product do customers save the most money?

For this question we are wondering whether or not certain categories of product save the customer significantly more money than other categories.

H0: The average amount saved on an order is the same for all categories.

Ha: The average amount saved on an order varies across categories.

To investigate these hypotheses we will need the variables of 'CategoryName', 'Discount', 'Quantity', and 'UnitPrice'. We will get these variables into one table by merging the tables 'order_details', 'products', and 'categories.

q6_data = pd.merge(order_details, 
                  products,
                  left_on='ProductId',
                  right_on='Id',
                  how='left')
q6_data = pd.merge(q6_data, 
                  categories,
                  left_on='CategoryId',
                  right_on='Id',
                  how='left')
q6_data = q6_data.rename(index=str, columns={"UnitPrice_x": "UnitPrice"})
q6_data = q6_data[['UnitPrice', 'Discount', 'Quantity', 'CategoryName']]
q6_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
UnitPrice Discount Quantity CategoryName
0 14.0 0.0 12 Dairy Products
1 9.8 0.0 10 Grains/Cereals
2 34.8 0.0 5 Dairy Products
3 18.6 0.0 9 Produce
4 42.4 0.0 40 Produce

Since we are concerned with the total dollar amount saved per order, we will need to calculate it using the formula below and save those values to a new column 'Saved'.

q6_data['Saved'] = q6_data['UnitPrice']*(q6_data['Discount'])*q6_data['Quantity']
q6_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
UnitPrice Discount Quantity CategoryName Saved
0 14.0 0.0 12 Dairy Products 0.0
1 9.8 0.0 10 Grains/Cereals 0.0
2 34.8 0.0 5 Dairy Products 0.0
3 18.6 0.0 9 Produce 0.0
4 42.4 0.0 40 Produce 0.0

Now we can intially view the average amount saved for each category.

cats = q6_data['CategoryName'].unique()
amount_saved = pd.DataFrame()
means = []
for i in cats:
    mean = q6_data['Saved'][q6_data['CategoryName'] == i].mean()
    means.append(mean)
    print(i + " - " + str(round(mean, 2)))
Dairy Products - 45.97
Grains/Cereals - 25.42
Produce - 38.85
Seafood - 31.4
Condiments - 35.41
Confections - 29.17
Beverages - 46.19
Meat/Poultry - 87.67
X = cats
y = []
for i in cats:
    y.append(q6_data['Saved'][q6_data['CategoryName'] == i].mean())
ax = plt.bar(X,y)
plt.xlabel('Category')
plt.ylabel('Amount Discounted')
plt.title('Average Discounted Amount Per Category')
rects = ax.patches
y_round = []
for i in y:
    y_round.append(round(i,2))
labels = y_round
for rect, label in zip(rects, labels):
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width() / 2, height + 1, label,
            ha='center', va='bottom')
plt.rcParams["figure.figsize"] = (15,15)
plt.show()

png

From the above it would seem that 'Meat/Poultry' products have a decently higher average savings amount than the other categories. To see if this difference is signficant, we can run a one-way ANOVA using 'Saved' as our dependent variable and 'CategoryName' as a categorical independent variable.

formula = 'Saved ~ C(CategoryName)'
lm = ols(formula, q6_data).fit()
table = sm.stats.anova_lm(lm, typ=1)
table
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
df sum_sq mean_sq F PR(>F)
C(CategoryName) 7.0 5.287607e+05 75537.237064 4.251642 0.000113
Residual 2147.0 3.814489e+07 17766.601726 NaN NaN

The p-value of our ANOVA is 0.000113 which is much less than our alpha of 0.05. This allows us to reject the null hypothesis and state that there is a statistically significant difference in amount saved across the different categories of products. To view more specifically where that difference lies we can utilize a Tukey HSD to compare all categories of products.

mc = MultiComparison(q6_data['Saved'], q6_data['CategoryName'])
result = mc.tukeyhsd()
print(result)
       Multiple Comparison of Means - Tukey HSD,FWER=0.05      
===============================================================
    group1         group2     meandiff  lower    upper   reject
---------------------------------------------------------------
  Beverages      Condiments   -10.7792 -44.8653 23.3068  False 
  Beverages     Confections   -17.0178 -46.9241 12.8885  False 
  Beverages    Dairy Products  -0.22   -29.4018 28.9618  False 
  Beverages    Grains/Cereals -20.7656 -55.9666 14.4354  False 
  Beverages     Meat/Poultry  41.4822   4.7394   78.225   True 
  Beverages       Produce      -7.332  -47.4219 32.7579  False 
  Beverages       Seafood     -14.787  -44.7924 15.2183  False 
  Condiments    Confections   -6.2386  -41.5471 29.0699  False 
  Condiments   Dairy Products 10.5592  -24.1378 45.2562  False 
  Condiments   Grains/Cereals -9.9864  -49.879  29.9062  False 
  Condiments    Meat/Poultry  52.2614   11.002  93.5208   True 
  Condiments      Produce      3.4472  -40.8191 47.7135  False 
  Condiments      Seafood     -4.0078  -39.4003 31.3846  False 
 Confections   Dairy Products 16.7978  -13.8031 47.3986  False 
 Confections   Grains/Cereals -3.7478  -40.1339 32.6382  False 
 Confections    Meat/Poultry    58.5   20.6203  96.3796   True 
 Confections      Produce      9.6858  -31.4485 50.8201  False 
 Confections      Seafood      2.2307  -29.1564 33.6179  False 
Dairy Products Grains/Cereals -20.5456 -56.3386 15.2474  False 
Dairy Products  Meat/Poultry  41.7022   4.3919  79.0125   True 
Dairy Products    Produce      -7.112  -47.7226 33.4987  False 
Dairy Products    Seafood     -14.567  -45.2647 16.1306  False 
Grains/Cereals  Meat/Poultry  62.2478  20.0626  104.4331  True 
Grains/Cereals    Produce     13.4336  -31.6968 58.5641  False 
Grains/Cereals    Seafood      5.9786  -30.4889 42.4461  False 
 Meat/Poultry     Produce     -48.8142 -95.1573 -2.4711   True 
 Meat/Poultry     Seafood     -56.2692 -94.2271 -18.3113  True 
   Produce        Seafood     -7.4551  -48.6615 33.7514  False 
---------------------------------------------------------------

The results above show that 'Meat/Poultry' products have a significantly higher amount of savings than every other category. This supports our initial observation of 'Meat/Poultry' products having the highest amount of savings on average.

To quantify the difference between 'Meat/Poultry' products and all other categories we can calculate Cohen's d for effect size.

g1 = q6_data['Saved'][q6_data['CategoryName'] == 'Meat/Poultry']
print("Cohen's d - Comparison with Meat/Poultry for Amount Saved")
for i in q6_data['CategoryName'].unique():
    if i == 'Meat/Poultry':
        pass
    else: 
        g2 = q6_data['Saved'][q6_data['CategoryName'] == i]
        diff = g1.mean() - g2.mean()
        n1, n2 = len(g1), len(g2)
        var1 = g1.var()
        var2 = g2.var()
        pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
        d = diff / np.sqrt(pooled_var)
        print(i + " - " + str(d))
Cohen's d - Comparison with Meat/Poultry for Amount Saved
Dairy Products - 0.2378370352889182
Grains/Cereals - 0.340686816138561
Produce - 0.23932959811169524
Seafood - 0.3448611182986366
Condiments - 0.28101210499961193
Confections - 0.3629350323586184
Beverages - 0.1931511976251025

The values of d range from ~0.19 to ~0.36 which indicates at least a small effect size for each comparison. We can also calculate Cohen's d for comparing 'Meat/Poultry' products versus all other products together.

g2 = q6_data['Saved'][q6_data['CategoryName'] != 'Meat/Poultry']
diff = g1.mean() - g2.mean()
n1, n2 = len(g1), len(g2)
var1 = g1.var()
var2 = g2.var()
pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
d = diff / np.sqrt(pooled_var)
print("Cohen's d: " + i + " v.s. all - " + str(d))
Cohen's d: Meat/Poultry v.s. all - 0.3791998107081624

From this we got a value for d of ~0.38 which suggets a small-to-medium effect.

Our findings show that Meat/Poultry products have significantly higher savings from discounts than any other category.

Conclusion

In this project we investigated several questions revolving around discounts given to products as well as the different categories of products and the employees who performed each order. Through our investigations we learned the following about the company's sales:

1) Providing a discount for an order increases the quantity of product ordered.
2) The level of discount does not affect the quantity of product ordered.
3) The use of a discount has the same effectiveness across all categories of product.
4) We cannot recommend an optimal discount level given the unit price of a product.
5) There is some variance in the performance of the employees but we cannot say that any one employee performs  better on average than another.
6) No employees excel at selling a particular category of product compared to the other employees.
7) Meat and poultry products by far provide the highest amount in savings for customers.

Using these findings, the company can increase their revenue by providing a discount on more orders without worrying about the amount of discount to offer, the category of product, or the product's unit price. They can also see that their employees are all on the relatively same level in terms of performance but also realize that none of their employees are specialized in selling a particular type of product. In addition, they can see that their meat and poultry products offer their customers the most savings so they can perhaps advertise that fact to boost sales and brand awareness.

About

Exploratory analysis of food sales data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%