Skip to content

Latest commit

 

History

History
598 lines (468 loc) · 22 KB

problem_set_3.md

File metadata and controls

598 lines (468 loc) · 22 KB

Problem Set 3 - Targeting and Scoring

Nicolas
April 21, 2016

To validate Module 3 and correctly answer the questions will require that you perform the following exercise first: take the code in the file named module3.R, and modify the probability model such as the predictors taken into account exclusively include (1) recency, (2) the log of recency, (3) frequency, (4) the log of frequency. Also, list the detailed predictions made for the first 10 customers.

Computing predictors and target variables

Loading the dataset

data <- read.delim(file = 'purchases.txt', header = FALSE, sep = '\t', dec = '.')

Adding headers and interpret the last column as a date, extract year of purchase

colnames(data) <- c('customer_id', 'purchase_amount', 'date_of_purchase')
data$date_of_purchase <- as.Date(data$date_of_purchase, "%Y-%m-%d")
data$year_of_purchase <- as.numeric(format(data$date_of_purchase, "%Y"))
data$days_since       <- as.numeric(difftime(time1 = "2016-01-01",
                                            time2 = data$date_of_purchase,
                                            units = "days"))

Computing key marketing indicators

Let's compute key marketing indicators: RFM variables as of a year ago, in 2014

library(dplyr)

customers_2014 <- data %>% filter(days_since > 365) %>% 
                  group_by(customer_id) %>% 
                  summarize(   # creates new variables:
                    recency = min(days_since) - 365,
                    first_purchase = max(days_since) - 365,
                    frequency = n(),
                    avg_amount = mean(purchase_amount),
                    max_amount = max(purchase_amount)    ) 

Here we are extracting all the predictors that we will use in the predictive model. Predictors are variables computed a year ago.

Computing revenues generated by customers in 2015

revenue_2015 <- data %>% filter(year_of_purchase == 2015) %>% 
                  group_by(customer_id) %>% 
                  summarize(   # creates a new variable:
                    revenue_2015 = sum(purchase_amount)  ) 

Here we are extracting the target variable. The next step is to merge both predictors and target variable to create the calibration dataset (ie the training dataset).

Merging 2014 customers and 2015 revenue datasets

# in_sample should be named training_df
# out_sample should be named test_df
in_sample <- merge(customers_2014, revenue_2015, all.x = TRUE)
in_sample$revenue_2015[is.na(in_sample$revenue_2015)] <- 0

# we add a new variable 'active_2015': 1 if there was a revenue in 2015, 0 otherwise
in_sample$active_2015 <- as.numeric(in_sample$revenue_2015 > 0)

in_sample dataset includes now the following data:

  • from customer_2014:
    • first_purchase
    • frequency
    • avg_amount
    • max_amount
  • from revenue_2015:
    • revenue_2015
  • active_2015 (1 or 0)

Displaying calibration (in-sample) data (i.e. the training dataset)

head(in_sample)
##   customer_id   recency first_purchase frequency avg_amount max_amount
## 1          10 3463.9583       3463.958         1       30.0         30
## 2          80  301.9583       3385.958         6       70.0         80
## 3          90  392.9583       3417.958        10      115.8        153
## 4         120 1035.9583       1035.958         1       20.0         20
## 5         130 2604.9583       3344.958         2       50.0         60
## 6         160 2597.9583       3211.958         2       30.0         30
##   revenue_2015 active_2015
## 1            0           0
## 2           80           1
## 3            0           0
## 4            0           0
## 5            0           0
## 6            0           0
summary(in_sample)
##   customer_id        recency         first_purchase       frequency     
##  Min.   :    10   Min.   :   0.958   Min.   :   0.958   Min.   : 1.000  
##  1st Qu.: 77710   1st Qu.: 257.958   1st Qu.: 795.958   1st Qu.: 1.000  
##  Median :127140   Median : 889.958   Median :1890.958   Median : 2.000  
##  Mean   :127315   Mean   :1122.545   Mean   :1788.369   Mean   : 2.665  
##  3rd Qu.:181800   3rd Qu.:1867.958   3rd Qu.:2694.958   3rd Qu.: 3.000  
##  Max.   :245840   Max.   :3648.958   Max.   :3650.958   Max.   :40.000  
##    avg_amount        max_amount       revenue_2015      active_2015    
##  Min.   :   5.00   Min.   :   5.00   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.:  21.25   1st Qu.:  25.00   1st Qu.:   0.00   1st Qu.:0.0000  
##  Median :  30.00   Median :  30.00   Median :   0.00   Median :0.0000  
##  Mean   :  55.53   Mean   :  65.72   Mean   :  21.22   Mean   :0.2299  
##  3rd Qu.:  50.00   3rd Qu.:  60.00   3rd Qu.:   0.00   3rd Qu.:0.0000  
##  Max.   :4500.00   Max.   :4500.00   Max.   :4500.00   Max.   :1.0000

Fitting the two models

1. Fitting the first model: the probability model

Here, we create a model to predict whether a customer is active in 2015 or not, based on the customer data of 2014.

The outcome of the prediction is either 1 or 0 (ie. two classes). This is a classification problem that can be solved using a logistic regression. The following code fits the above training set to a logistic regression model.

As required in this problem set, the predictors are now: (1) recency, (2) the log of recency, (3) frequency, (4) the log of frequency.

prob.model <- glm(active_2015 ~ recency + log(recency) + frequency + log(frequency),
    family=binomial(link='logit'),
    data=in_sample)

summary(prob.model)
## 
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency + 
##     log(frequency), family = binomial(link = "logit"), data = in_sample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1792  -0.5792  -0.2533  -0.0920   3.5471  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.147e-01  8.437e-02    3.73 0.000191 ***
## recency        -1.347e-03  6.378e-05  -21.12  < 2e-16 ***
## log(recency)   -2.223e-01  1.810e-02  -12.29  < 2e-16 ***
## frequency      -1.394e-02  1.885e-02   -0.74 0.459381    
## log(frequency)  8.788e-01  7.205e-02   12.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18228  on 16904  degrees of freedom
## Residual deviance: 12051  on 16900  degrees of freedom
## AIC: 12061
## 
## Number of Fisher Scoring iterations: 6

Then shows the coefficients of the model

coef <- summary(prob.model)$coefficients[,1]  
# see to find out how to access this field: 
# attributes( summary( prob.model )$coefficients )
coef
##    (Intercept)        recency   log(recency)      frequency log(frequency) 
##    0.314704392   -0.001346672   -0.222337680   -0.013943181    0.878833064

The coefficient (weight) of the recency is negative, which makes sense: the larger the recency, the smaller the probability the customer will make another purchase in the future.

Then shows its standard deviation:

#std  <- summary(prob.model)$standard.errors
std <- summary(prob.model)$coefficients[,2]
std
##    (Intercept)        recency   log(recency)      frequency log(frequency) 
##   8.436725e-02   6.377517e-05   1.809762e-02   1.884553e-02   7.204729e-02

and the ratio of the coefficient divided by its standard deviation:

coef / std
##    (Intercept)        recency   log(recency)      frequency log(frequency) 
##      3.7301725    -21.1159242    -12.2854634     -0.7398668     12.1980031

As a rule, when coef/std is above 2 or below -2, the parameter value is significant (within a 95% confidence interval, ie Z = +/- 1.96). Here, the impact of frequency on the predictions is limited.

2. Fitting the second model: the monetary model

Here, we create a model to predict the revenue generated by a customer in 2015, based on the average amount and maximum amount he spent in 2014.

First, let's select from the training set only customers who made a purchase in 2015:

z <- which(in_sample$active_2015 == 1)    # store the index of active customer in 2015
head(in_sample[z, ])
##    customer_id    recency first_purchase frequency avg_amount max_amount
## 2           80  301.95833       3385.958         6   70.00000         80
## 18         480   15.95833       3312.958        11   62.27273        235
## 30         830  266.95833       3373.958         6   48.33333         60
## 31         850   61.95833       3050.958         8   28.12500         30
## 32         860  266.95833       3642.958         9   53.33333         60
## 39        1020 1461.95833       1826.958         3   73.33333        100
##    revenue_2015 active_2015
## 2            80           1
## 18           45           1
## 30           50           1
## 31           60           1
## 32           60           1
## 39          120           1
summary(in_sample[z, ])
##   customer_id        recency         first_purchase       frequency     
##  Min.   :    80   Min.   :   0.958   Min.   :   0.958   Min.   : 1.000  
##  1st Qu.: 78590   1st Qu.:  22.958   1st Qu.: 649.708   1st Qu.: 2.000  
##  Median :143550   Median :  96.958   Median :1603.958   Median : 4.000  
##  Mean   :134907   Mean   : 306.305   Mean   :1636.122   Mean   : 4.741  
##  3rd Qu.:194363   3rd Qu.: 327.958   3rd Qu.:2665.958   3rd Qu.: 7.000  
##  Max.   :236660   Max.   :3543.958   Max.   :3646.958   Max.   :40.000  
##    avg_amount        max_amount       revenue_2015     active_2015
##  Min.   :   5.00   Min.   :   5.00   Min.   :   5.0   Min.   :1   
##  1st Qu.:  30.00   1st Qu.:  30.00   1st Qu.:  30.0   1st Qu.:1   
##  Median :  40.00   Median :  50.00   Median :  50.0   Median :1   
##  Mean   :  67.78   Mean   :  88.33   Mean   :  92.3   Mean   :1   
##  3rd Qu.:  60.00   3rd Qu.:  80.00   3rd Qu.: 100.0   3rd Qu.:1   
##  Max.   :4500.00   Max.   :4500.00   Max.   :4500.0   Max.   :1

Then let's fit the monetary model with the training set: revenue generated by a customer in 2015, as a function of its average amount and its maximum amount. Since the revenue can take a range of value (not only either 1 or 0), this is a regression problem that can be solved here using a linear model (lm):

amount.model <- lm(
  formula = revenue_2015 ~ avg_amount + max_amount, 
  data = in_sample[z, ])     # fits the model only to active customers in 2015 in the training set

summary(amount.model)
## 
## Call:
## lm(formula = revenue_2015 ~ avg_amount + max_amount, data = in_sample[z, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2138.5   -20.1   -16.6     0.1  3361.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.74709    2.38112   8.713   <2e-16 ***
## avg_amount   0.67486    0.03280  20.575   <2e-16 ***
## max_amount   0.29225    0.02363  12.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136.6 on 3883 degrees of freedom
## Multiple R-squared:  0.6054,	Adjusted R-squared:  0.6052 
## F-statistic:  2979 on 2 and 3883 DF,  p-value: < 2.2e-16

All statistics (standard error, t value) are highly significant. The coefficient of determination R^2 value is 0.60.

Plotting the results of the monetary model

In order to validate our our model (on the training set), let's plot the prediction of revenue generated by 2015 customers, computed with the linear model, against the revenue generated by 2015 customers (the ground truth):

plot(x = in_sample[z, ]$revenue_2015,
     y = amount.model$fitted.values)

# Note: ggplot requires data from the same dataframe. Simpler to use plot here.

If our model is accurate, we should see linear relation between the prediction and the ground truth, which is not the case here: most customers have spent small amounts,few outliers have spent huge amounts.

Re-fitting the monetary model, using now a log-transform

Let's try again with a log-transform:

amount.model <- lm(
  formula = log(revenue_2015) ~ log(avg_amount) + log(max_amount), 
  data = in_sample[z, ])

summary(amount.model)
## 
## Call:
## lm(formula = log(revenue_2015) ~ log(avg_amount) + log(max_amount), 
##     data = in_sample[z, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7866 -0.1811 -0.0770  0.1852  3.5656 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.37000    0.04003   9.242   <2e-16 ***
## log(avg_amount)  0.54881    0.04167  13.171   <2e-16 ***
## log(max_amount)  0.38813    0.03796  10.224   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4781 on 3883 degrees of freedom
## Multiple R-squared:  0.6927,	Adjusted R-squared:  0.6926 
## F-statistic:  4377 on 2 and 3883 DF,  p-value: < 2.2e-16

We can see that the R^2 has improved to 0.69 meaning that we fit the data much better.

Plotting the results of this new monetary model

plot(x = log(in_sample[z, ]$revenue_2015), 
     y = amount.model$fitted.values)      

This is much better. We put less weight to the very large values and more weight to the smaller values.


Applying the models on today's data

Now that we fitted our two models (prob.model and amount.model) to 2015's data (as a function of 2014's data), we want to apply the two models to 2015's data in order to predict future behavior of customers in 2016.

Computing RFM variables as of today, in 2015

customers_2015 <- data %>% group_by(customer_id) %>% 
                  summarize(   # creates new variables:
                    recency = min(days_since),
                    first_purchase = max(days_since),
                    frequency = n(),
                    avg_amount = mean(purchase_amount),
                    max_amount = max(purchase_amount)    ) 

Predicting the target variables based on today's data

Now we are predicting the probability that a customer will be active in 2016 (ie the probability that the outcome is 1):

customers_2015$prob_predicted <- predict(object = prob.model, 
                                         newdata = customers_2015, 
                                         type = "response")  # argument to get the probability that the outcome is 1, using the glm function

and predicting the revenue generated by a customer in 2016 (taking the exp of the prediction, since we fitted the model using a log transform):

customers_2015$revenue_predicted <- exp(predict(object = amount.model, 
                                                newdata = customers_2015))

Then we compute the score of customers, defined as the probability that a customer will be active in 2016 times the predicted revenue generated in 2016:

customers_2015$score_predicted <- customers_2015$prob_predicted * customers_2015$revenue_predicted

If there is a 10% chance that a customer generates $100, the score will be $10.

Then displays results:

summary(customers_2015$prob_predicted)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0009581 0.0194200 0.0963900 0.2252000 0.3638000 0.9532000

Customers has a 22.5% chance of a customer being active in 2016, on average.

summary(customers_2015$revenue_predicted)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.54   29.00   35.05   65.63   57.30 3833.00

Customers are predicted to generate a revenue of $65 in 2016, on average. The minimum revenue is $6.5 (and not $0 because it assumes the customer will spend something).

summary(customers_2015$score_predicted)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0113    0.6806    4.0730   18.2500   17.6000 2728.0000

The score has a mean of 18.2. This value is extremely important from a managerial standpoint: on average, every customer will spend $18.2 next year, in 2016. This is the expected revenue in 2016.

library(ggplot2)

ggplot(data=customers_2015, aes(score_predicted)) + 
  geom_histogram(binwidth = 5, color = I('black'), fill= I('orange')) +
  scale_x_continuous(limits = c(0, 200),
                     breaks = seq(0, 200, 20))
## Warning: Removed 188 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Questions

1/ In the modified model, which of the following predictors does not seem to bear a significant influence on probability predictions?

summary(prob.model)
## 
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency + 
##     log(frequency), family = binomial(link = "logit"), data = in_sample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1792  -0.5792  -0.2533  -0.0920   3.5471  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.147e-01  8.437e-02    3.73 0.000191 ***
## recency        -1.347e-03  6.378e-05  -21.12  < 2e-16 ***
## log(recency)   -2.223e-01  1.810e-02  -12.29  < 2e-16 ***
## frequency      -1.394e-02  1.885e-02   -0.74 0.459381    
## log(frequency)  8.788e-01  7.205e-02   12.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18228  on 16904  degrees of freedom
## Residual deviance: 12051  on 16900  degrees of freedom
## AIC: 12061
## 
## Number of Fisher Scoring iterations: 6

Frequency has a small Z_value (between -2 and 2) and hence doesn't seem to bear a significant influence on probability predictions. Recency seems to have the most impact on the predictions, with a Z_value = -21.

2/ Suppose a customer has a 20% probability of being active, and a predicted revenue of $200 if he is. Compute the score of this costumer.

Score = probability x predicted revenue = 20% x $200 = $40.
The score of that customer is $40 (ie on average, this customer is expected to generate $40 in revenue next year). This customer has a 80% chance of not buying anything next year.

3/ Looking at the predictions made for the first 10 customers in the database, which of the following statements is correct?

head(customers_2015, 10)
## Source: local data frame [10 x 9]
## 
##    customer_id   recency first_purchase frequency avg_amount max_amount
##          (int)     (dbl)          (dbl)     (int)      (dbl)      (dbl)
## 1           10 3828.9583       3828.958         1   30.00000         30
## 2           80  342.9583       3750.958         7   71.42857         80
## 3           90  757.9583       3782.958        10  115.80000        153
## 4          120 1400.9583       1400.958         1   20.00000         20
## 5          130 2969.9583       3709.958         2   50.00000         60
## 6          160 2962.9583       3576.958         2   30.00000         30
## 7          190 2210.9583       3689.958         5   68.00000        100
## 8          220 2057.9583       3663.958         2   25.00000         30
## 9          230 3984.9583       3984.958         1   50.00000         50
## 10         240  462.9583       3814.958         4   16.25000         20
## Variables not shown: prob_predicted (dbl), revenue_predicted (dbl),
##   score_predicted (dbl)

Customer #80 is more likely than not to make a purchase next year. Customer #190 has made 5 purchases so far. Customer #230 has a score of 0.0564. If active, customer #130 is expected to spend $60.7 on average.

4/ Among the following customers, which one is expected to be the most profitable for the firm?

The one who has the highest score is the most profitable for the firm. For the first 10 customers in the database, it is customer #90. See table above.

5/ In the probability model, why is the sign of the recency parameter expected to be negative?

summary(prob.model)
## 
## Call:
## glm(formula = active_2015 ~ recency + log(recency) + frequency + 
##     log(frequency), family = binomial(link = "logit"), data = in_sample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1792  -0.5792  -0.2533  -0.0920   3.5471  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.147e-01  8.437e-02    3.73 0.000191 ***
## recency        -1.347e-03  6.378e-05  -21.12  < 2e-16 ***
## log(recency)   -2.223e-01  1.810e-02  -12.29  < 2e-16 ***
## frequency      -1.394e-02  1.885e-02   -0.74 0.459381    
## log(frequency)  8.788e-01  7.205e-02   12.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 18228  on 16904  degrees of freedom
## Residual deviance: 12051  on 16900  degrees of freedom
## AIC: 12061
## 
## Number of Fisher Scoring iterations: 6

The negative sign of the recency parameter ('estimate') means that the larger the recency, the smaller the probability that a customer will be active the following year.

6/ How many customers have an expected revenue of more than $50

z = which(customers_2015$score_predicted > 50)
length(z)
## [1] 1394

1394 customers have an expected revenue of more than $50 in 2016. We have the list of customers on which we should spend the most marketing dollar (ie the customers with the highest score).