bayesian_project.Rmd

---
title: "Bayesian modeling and prediction for movies"
output: 
  html_document: 
    fig_height: 4
    toc: true
    highlight: pygments
    theme: spacelab
---


### Submission by Connor Lenio. Email: cojamalo@gmail.com
Completion Date: May 23, 2017


## Setup
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,fig.align='center', message = FALSE, warning=FALSE, fig.width=9, fig.height = 5)
```

### Load packages

```{r load-packages, message = FALSE}
library(ggplot2)
library(statsr)
library(BAS)
library(pander)
library(gridExtra)
library(GGally)
library(dplyr)
```

<br>

### Load data

```{r load-data}
load("movies.Rdata")
```


* * *

## Part 1: Data

The `movies` dataset supplied for this project includes 651 randomly sampled movies released in the United States from 1970-2016. No other information is given as to how the random sampling was conducted. The data likely represents a simple random sample (SRS) of movies that does not use any stratification or blocking strategies. For instance, there are varying amounts of movies for each theater release year as well as different proportions of movies for each genre than is likely present in the population of all movies released from 1970-2016. Thus, there are no controls in place for the SRS to closely resemble actual distributions of movies from 1970-2016. This means the data is highly susceptible to sampling variation when considering its representatives of all movies from this time.  

Moreover, a lack of strategic sampling design means any sources of dependency between individual movies such as the fact they are part of the same series, such as is the case for sequels, is not controlled for and considered in this analysis. In addition, the measures of movie popularity such as the Rotten Tomatoes scores and the IMDB rating are themselves samples of opinions. These websites are reliant on voluntary participation by users and, therefore, these popularity scores are also possibly biased, and such rates of bias could be different for each individual movie. Therefore, any conclusions drawn from the data are generalizable to US movies since a simple random sample was used. However, there are many sources of lingering uncertainty in the external validity of the data considering the potential sources of bias a lack of a strategic sampling design introduces.

The simple random sample does not involve random assignment of movies to the factors under consideration. In addition, the data do not represent an experiment or observational study, and, thus, there are limitations to what sort of conclusions can be drawn from the data. Any identified associations will be complicated by lurking variables and bias. It is impossible to rule out other confounding factors that may affect any discovered associations between the factors in the analysis. Therefore, no causality can be determined from inference on this data, only evidence for associations. Such relationships are useful for hypothesis formation for the design of future studies that may establish causality. Even without causality, the relationships present in this data may be informative about features of modern US movies and their popularity.

* * *

## Part 2: Data Manipulation

Before the new variables are added, the `genre` variable is updated to be more informative. In viewing the `genre` variable, the factor level "Other" is uninformative, and, thus, any drama movies were converted from the "Other" genre to "Drama" using information from Rotten Tomatoes. Moreover, runtime information is also updated and the entire dataset is copied before adding the new variables.
```{r fix-genre}
# Movies with genre "Other" that are actually dramas
movies$genre[which(movies$title == "Django Unchained")] <- "Drama"
movies$genre[which(movies$title == "The Color of Money")] <- "Drama"
movies$genre[which(movies$title == "Down in the Valley")] <- "Drama"
movies$genre[which(movies$title == "The Fighter")] <- "Drama"
movies$genre[which(movies$title == "Urban Cowboy")] <- "Drama"
# Movie where runtime is NA, but actual runtime is known
movies$runtime[which(movies$title == "The End of America")] <- 71
# Make a copy of data before adding new variables
movies_old <- movies
```

<br>

The following code adds the new, required variables using the `mutate` function and if-else statements. Only the variables requested by the assignment are kept after this step.
```{r data-clean-1}
# New categorical variables added
movies <- movies %>% mutate(feature_film = factor(ifelse(title_type == "Feature Film", "yes", "no")), 
                            drama = factor(ifelse(genre == "Drama", "yes", "no")), 
                            mpaa_rating_R = factor(ifelse(mpaa_rating == "R", "yes", "no"))) 
# New time-based variables added
movies <- movies %>% mutate(oscar_season = factor(ifelse(thtr_rel_month > 9,"yes", "no")),
                            summer_season = factor(ifelse(thtr_rel_month > 4 & thtr_rel_month < 9,"yes", "no")))
# Only the required variables are selected for
movies <- movies %>% select(audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season,imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
```
The data is now ready for analysis.

<br>


* * *

## Part 3: Exploratory data analysis

The response variable for this project is the quantitative variable, `audience_score`, the Rotten Tomatoes movie score by its users. Before a Bayesian model is fit, the relationships between `audience_score` and the newly added variables from Part 2 are explored.

### New Factor Variable: `feature_film`
The following code constructs both a summary statistic table and a side-by-side boxplot and density plot to explore the relationship of a categorical explanatory variable with a quantitative response variable. The same code is used for the other new categorical variables, but only the first instance is shown for the sake of brevity.
```{r feature-film, results="asis"}
# Summary table of n, mean, and variance by factor group
sum_table <- movies %>% 
            tbl_df %>%
            group_by(feature_film) %>%  
            summarize(n = n(), Mean = round(mean(audience_score),1), 
                      Variance = round(var(audience_score), 1))
pandoc.table(sum_table)
```

<br>

```{r feature-film-2}
boxplot <- ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = factor(feature_film), color = factor(feature_film))) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.2, 0.4), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of `audience_score` by `feature_film`", y = "`audience_score`", x = "`feature_film`", fill = "`feature_film`",color = "`feature_film`")

density <- ggplot(movies, aes(audience_score)) + geom_density(aes(fill = factor(feature_film),color = factor(feature_film)), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of `audience_score` by `feature_film`", y = "P(`audience_score`)", x = "`audience_score`", fill = "`feature_film`",color = "`feature_film`")

grid.arrange(boxplot, density,ncol = 2)
```
While there are relatively few non-feature films in the data (60/651), there does appear to be a relationship between `feature_film` and `audience_score`. Movies that are not feature films have an average of 20.5 greater audience score than those movies that are feature films. In addition, accounting for the variance and IQR of the data, the scores of the two groups appear distinct but inferential techniques are necessary to establish a significant difference in score between the two groups.

<br>

### New Factor Variable: `drama`

```{r drama, echo=FALSE, results="asis"}
sum_table <- movies %>% 
            tbl_df %>%
            group_by(drama) %>%  
            summarize(n = n(), Mean = round(mean(audience_score),1), 
                      Variance = round(var(audience_score), 1))
pandoc.table(sum_table)
```

<br>

```{r drama-2, echo=FALSE}
boxplot <- ggplot(data = movies, aes(x = drama, y = audience_score, fill = factor(drama), color = factor(drama))) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.5, 0.2), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of `audience_score` by `drama`", y = "`audience_score`", x = "`drama`", fill = "`drama`",color = "`drama`")

density <- ggplot(movies, aes(audience_score)) + geom_density(aes(fill = factor(drama),color = factor(drama)), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of `audience_score` by `drama`", y = "P(`audience_score`)", x = "`audience_score`", fill = "`drama`",color = "`drama`")

grid.arrange(boxplot, density,ncol = 2)
```
About half of the movies in the dataset are dramas. There does not appear to be a relationship between `drama` and `audience_score`. As the plots indicate, there is great overlap in both the IQR and probability densities of the two groups. Inferential techniques are necessary to establish a significant difference in score between the two groups.


<br>

### New Factor Variable: `mpaa_rating_R`
```{r mpaa-rating, echo=FALSE, results="asis"}
sum_table <- movies %>% 
            tbl_df %>%
            group_by(mpaa_rating_R) %>%  
            summarize(n = n(), Mean = round(mean(audience_score),1), 
                      Variance = round(var(audience_score), 1))
pandoc.table(sum_table)
```

<br>

```{r mpaa-rating-2, echo=FALSE}
boxplot <- ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = factor(mpaa_rating_R), color = factor(mpaa_rating_R))) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.5, 0.2), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of `audience_score` by `mpaa_rating_R`", y = "`audience_score`", x = "`mpaa_rating_R`", fill = "`mpaa_rating_R`",color = "`mpaa_rating_R`")

density <- ggplot(movies, aes(audience_score)) + geom_density(aes(fill = factor(mpaa_rating_R),color = factor(mpaa_rating_R)), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of `audience_score` by `mpaa_rating_R`", y = "P(`audience_score`)", x = "`audience_score`", fill = "`mpaa_rating_R`",color = "`mpaa_rating_R`")

grid.arrange(boxplot, density,ncol = 2)
```
About half of the movies in the dataset are rated "R" by the MPAA. There does not appear to be a relationship between `mpaa_rating_R` and `audience_score`. As the plots indicate, there is great overlap in both the IQR and probability densities of the two groups. Inferential techniques are necessary to establish a significant difference in score between the two groups.

<br>

### New Factor Variable: `oscar_season`
```{r oscar-season, echo=FALSE, results="asis"}
sum_table <- movies %>% 
            tbl_df %>%
            group_by(oscar_season) %>%  
            summarize(n = n(), Mean = round(mean(audience_score),1), 
                      Variance = round(var(audience_score), 1))
pandoc.table(sum_table)
```
<br>

```{r oscar-season-2, echo=FALSE}
boxplot <- ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = factor(oscar_season), color = factor(oscar_season))) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.5, 0.2), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of `audience_score` by `oscar_season`", y = "`audience_score`", x = "`oscar_season`", fill = "`oscar_season`",color = "`oscar_season`")

density <- ggplot(movies, aes(audience_score)) + geom_density(aes(fill = factor(oscar_season),color = factor(oscar_season)), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of `audience_score` by `oscar_season`", y = "P(`audience_score`)", x = "`audience_score`", fill = "`oscar_season`",color = "`oscar_season`")

grid.arrange(boxplot, density,ncol = 2)
```
There are fewer movies released outside of the Oscar season than within it. There does not appear to be a relationship between `oscar_season` and `audience_score`. As the plots indicate, there is great overlap in both the IQR and probability densities of the two groups. Inferential techniques are necessary to establish a significant difference in score between the two groups.

<br>

### New Factor Variable: `summer_season`
```{r summer-season, echo=FALSE, results="asis"}
sum_table <- movies %>% 
            tbl_df %>%
            group_by(summer_season) %>%  
            summarize(n = n(), Mean = round(mean(audience_score),1), 
                      Variance = round(var(audience_score), 1))
pandoc.table(sum_table)
```
<br>

```{r summer-season-2, echo=FALSE}
boxplot <- ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = factor(summer_season), color = factor(summer_season))) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.5, 0.2), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of `audience_score` by `summer_season`", y = "`audience_score`", x = "`summer_season`", fill = "`summer_season`",color = "`summer_season`")

density <- ggplot(movies, aes(audience_score)) + geom_density(aes(fill = factor(summer_season),color = factor(summer_season)), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.2, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of `audience_score` by `summer_season`", y = "P(`audience_score`)", x = "`audience_score`", fill = "`summer_season`",color = "`summer_season`")

grid.arrange(boxplot, density,ncol = 2)
```
There are fewer movies released outside of the summer months than during them. There does not appear to be a relationship between `summer_season` and `audience_score`. As the plots indicate, there is great overlap in both the IQR and probability densities of the two groups. Inferential techniques are necessary to establish a significant difference in score between the two groups.

<br>

While Bayesian modeling will establish if there are any strong relationships between the new variables and `audience_score`, from the exploratory analysis, `feature_film` appears to show the most informative relationship with `audience_score`, with the other new variables appearing to have weak or no relationships with `audience_score`.

<br>

### Linear Fit Checks

Finally, all quantitative variables are plotted against `audience_score` to determine if any linear transformations should be included before modeling.
```{r lin-fit-check}
# Custom plot function for ggpairs scatterplot
my_fn <- function(data, mapping, ...){
    p <- ggplot(data = data, mapping = mapping) + 
        geom_point() + 
        geom_smooth(method=loess, fill="red", color="red", ...) +
        geom_smooth(method=lm, fill="blue", color="blue", ...)
    p
}

movies %>% select(runtime, thtr_rel_year, imdb_rating, critics_score, audience_score) %>% ggpairs(lower = list(continuous = my_fn))
```
Looking at the bottom row of this plot and comparing loess curves (red) to the linear regression lines (blue), the relationships between these quantitative variables and `audience_score` are mostly linear, so no linear transformations will be conducted prior to modeling. However, there are large departures from linearity for the `runtime` value at extremely high and low values of `runtime`. In addition, low values of `imdb_rating` also cause a deviation from normality for its relationship with `audience_score`. These deviations may cause issues in any models that rely heavily on these variables to make predictions. Nevertheless, these departures from normality occur at more extreme values of the variables, so it is worth considering how a potential model performs for the majority of predictions while taking note of these potential sources of prediction bias and variance.

<br>

* * *

## Part 4: Modeling

### Fit the BAS Model

Using the `BAS` package, a Bayesian regression is conducted for the dataset. A BIC prior is used in addition to the uniform model prior:
```{r fit-model-1}
movies.bas =  bas.lm(audience_score ~ ., 
                   data=movies,
                  initprobs = "eplogp",
                   prior="BIC",
                   modelprior=uniform()) 
```

<br>

### Check the Coefficients

To begin assessing this model, the coefficients for each explanatory variable, or the beta values, are reviewed to determine what explanatory variables the Bayesian models favor the most. One way to visualize this information is by plotting the marginal posterior inclusion probabilities, also known as the posterior probabilities that the coefficients' values are not zero. The marginal inclusion probabilities signify a significant chance a relationship exists between the explanatory variable and the response variable, `audience_score`. For this plot, the coefficients that have an inclusion probability greater than 0.5 (chance) are highlighted in red. These explanatory variables have the most statistically significant relationships with `audience_score`. The BMA, or Bayesian Model Average, is used as the reference model for this analysis as it accounts for the greatest uncertainty in model construction by including all the probabilities for every possible model. Thus, every possible explanatory variable will have some coefficient under the BMA and inclusion probability. 
```{r coef-check}
coefs <- coef(movies.bas, estimator = "BMA")
# find posterior probabilities 
coefs_bas <- data.frame(parameter = coefs$namesx, post_mean = coefs$postmean, post_SD = coefs$postsd, post_pne0 = coefs$probne0) %>% arrange(post_pne0) %>% filter(parameter != "Intercept")
coefs_bas$parameter <- factor(coefs_bas$parameter, levels = coefs_bas$parameter[order(coefs_bas$post_pne0, decreasing = TRUE)])
high_pne0 <- data.frame(parameter = coefs_bas$parameter, post_pne0 = coefs_bas$post_pne0) %>% filter(post_pne0 > 0.5)
# Plot the Data
ggplot(coefs_bas, aes(x = parameter, y = post_pne0)) + 
    geom_pointrange(aes(ymax = post_pne0), ymin = 0) +
    geom_pointrange(data=high_pne0, aes(x = parameter, y = post_pne0, ymax = post_pne0), ymin = 0, color = "red") +
    geom_hline(yintercept = 0.5, color = "red") +
    labs(title = "Posterior Marginal Inclusion Probabilities of Explanatory Variables",x="Explanatory Variable",y = "Marginal Inclusion Probability") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1), plot.title = element_text(hjust = 0.5))
```

The two explanatory variables with a posterior probability greater than 0.5 are `imdb_rating` and `critics_score`. These two variables have a high probability of having a relationship with `audience_score`. 

<br>

After identifying the variables with high marginal inclusion probabilities for a model predicting `audience_score`, the coefficients are plotted by their value under the BMA. The variables with marginal inclusion probabilities greater than 0.5 are highlighted in red. The credible intervals for each beta value is also included using a point range plot.
```{r beta-plot}
# find credible intervals for betas
coefs_beta <- data.frame(confint(coefs, c(2:17))[,])
coefs_beta$parameter <- rownames(coefs_beta)
rownames(coefs_beta) <- NULL
coefs_bas <- coefs_bas %>% left_join(coefs_beta)
coefs_bas$parameter <- factor(coefs_bas$parameter, levels = coefs_bas$parameter[order(coefs_bas$beta)])
high_pne0 <- high_pne0 %>% left_join(coefs_bas)
# Plot the data
ggplot(coefs_bas, aes(y = beta, x = parameter)) + 
    geom_pointrange(aes(ymax = X97.5..., ymin = X2.5...)) +
    geom_point(data=high_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
    geom_hline(yintercept = 0, color = "red") +
    labs(title = "Beta Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
    theme(plot.title = element_text(hjust = 0.5)) +
    coord_flip() 
```

This plot shows that not only does `imdb_rating` have a high marginal inclusion probability, it has a large beta as well. For every one point increase in the `imdb_rating`, an increase of about 15 points in `audience_score` is expected. Thus, any changes in `imdb_rating` will have a large effect on the model's predictions.

<br>

### Model Diagnostics

Next, the residuals are checked to evaluate the bias and variance in the model's predictions:
```{r resid-1}
# Residuals vs fitted plot
plot(movies.bas, ask=F, which=1)
```
For most of the fitted values, the residuals are centered around zero, and the model appears to show a decrease in residual variance at is higher predicted values. However, for fitted values less than 40, the model tends to underestimate the audience scores of the movies with high variance and this bias gets progressively worse the lower the predicted value. The plot labels three possible outliers in this region as well.

<br>

From the exploratory linear fit ggpairs plot in the "Linear fit checks" section, a concern was raised as to the departure from linearity for some of the variables at extreme values. One of those variables was `imdb_rating`. Since `imdb_rating` is such a highly weighted variable in the model with a large beta, it is worth checking if issues with this variable are contributing to the bias and variance issues in the model's residual plot. A residuals versus explanatory variable plot is used to evaluate this concern:
```{r resid-imdb}
fit <- data.frame(resid = movies$audience_score - predict(movies.bas,estimator = "BMA")$fit) 
ggplot(fit, aes(x=movies$imdb_rating, y = resid))+ 
    geom_point() + 
    geom_smooth(color="red", lwd = 0.5, se = FALSE, span = 0.5)+
    geom_vline(xintercept = 4.6, color = "blue")+
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(title = "Residuals vs IMDB Rating Plot",x="IMDB Rating",y = "Residuals") +
    theme_bw()
```
This plot shows a similar residual distribution as the model's residual versus fit plot with the underestimation of `audience_score` problem for movies with a `imdb_rating` of less than about 4.6 points (marked on the plot with a blue vertical line). It is possible to write this problem off as most movies do not have this low of a `imdb_rating`. However, since the model's possible outliers lie in this region and one may want a more robust model with predictive capability for movies with these low scores, this analysis will attempt to correct for this issue. 

<br>


### Investigate a Model Adjustment

The ideal solution to this problem is to process more data. A larger sample size would contain more instances of movies at these values and improve the model's predictions for this range of values. Nonetheless, since the data was supplied in an already collected from for this project, the assumption is that new data collection is not possible as this would require a completely new sample to be created. That said, there are other options to address this issue given the data at hand. 

It is possible to address the underestimation problem by giving the model building process access to another variable that can boost the predicted scores for movies with an `imdb_rating` of less than 4.6. The bias in the prediction gets progressively worse the lower the `imdb_rating` and this progressive deterioration of the model's prediction bias appears to occur in a linear fashion. Thus, a new variable will increase linearly as the `imdb_rating` shrinks below 4.6. The new variable will be called `imdb_rating_lo` and is derived from `imdb_rating` in the following way: 

```{r adj-model}
movies <- mutate(movies, imdb_rating_lo = ifelse(imdb_rating < 4.6, 4.6 - imdb_rating, 0))
```
While this process does introduce the assumption into the model that movies with low `imdb_ratings` are somehow different than other movies, this new variable is based entirely on an existing explanatory variable alone and is just a more complex version than variables such as `oscar_season` where the BAS process is able to group the movies by subset values of existing explanatory variables.

<br>

### Fit the Adjusted BAS Model

Using the `BAS` package, a Bayesian regression is fit to the data with the new variable in the same fashion as the first BAS fit.
```{r refit-model}
movies_adj.bas =  bas.lm(audience_score ~ ., 
                   data=movies,
                  initprobs = "eplogp",
                   prior="BIC",
                   modelprior=uniform()) 
```

<br>

### Check the Adjusted Model's Coefficients

Again, the coefficients for each explanatory variable, or the beta values, are reviewed to find out what explanatory variables the Bayesian models favor the most. 
```{r coef-check-2}
coefs <- coef(movies_adj.bas, estimator = "BMA")
# find posterior probabilities 
coefs_bas <- data.frame(parameter = coefs$namesx, post_mean = coefs$postmean, post_SD = coefs$postsd, post_pne0 = coefs$probne0) %>% arrange(post_pne0) %>% filter(parameter != "Intercept")
coefs_bas$parameter <- factor(coefs_bas$parameter, levels = coefs_bas$parameter[order(coefs_bas$post_pne0, decreasing = TRUE)])
high_pne0 <- data.frame(parameter = coefs_bas$parameter, post_pne0 = coefs_bas$post_pne0) %>% filter(post_pne0 > 0.5)
# Plot the data
ggplot(coefs_bas, aes(x = parameter, y = post_pne0)) + 
    geom_pointrange(aes(ymax = post_pne0), ymin = 0) +
    geom_pointrange(data=high_pne0, aes(x = parameter, y = post_pne0, ymax = post_pne0), ymin = 0, color = "red") +
    geom_hline(yintercept = 0.5, color = "red") +
    labs(title = "Posterior Marginal Inclusion Probabilities of Explanatory Variables",x="Explanatory Variable",y = "Marginal Inclusion Probability") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1), plot.title = element_text(hjust = 0.5))
```
With the derived variable `imdb_rating_lo` included, in addition to `imdb_rating`, both `runtime` and `imdb_rating_lo` now have a high inclusion probability in the BMA and `critics_score` does not.

<br>

The coefficients are plotted by their value under the BMA. The variables with marginal inclusion probabilities greater than 0.5 are highlighted in red. The credible intervals for each beta value is also included using a point range plot.
```{r beta-2}
# find credible intervals for betas
coefs_beta <- data.frame(confint(coefs, c(2:18))[,])
coefs_beta$parameter <- rownames(coefs_beta)
rownames(coefs_beta) <- NULL
coefs_bas <- coefs_bas %>% left_join(coefs_beta)
coefs_bas$parameter <- factor(coefs_bas$parameter, levels = coefs_bas$parameter[order(coefs_bas$beta)])
high_pne0 <- high_pne0 %>% left_join(coefs_bas)
# Plot the Data
ggplot(coefs_bas, aes(y = beta, x = parameter)) + 
    geom_pointrange(aes(ymax = X97.5..., ymin = X2.5...)) +
    geom_point(data=high_pne0, aes(y = beta, x = parameter), color = "red", size = 2.5) +
    geom_hline(yintercept = 0, color = "red") +
    labs(title = "Beta Values with 95% Credible Intervals",x="Explanatory Variable",y = "Beta Value") +
    coord_flip() 
```
This plot shows that both `imdb_rating` and `imdb_rating_lo` have large betas and high marginal inclusion probabilities. Moreover, the betas for these two variables are larger than they were in the unadjusted model. 


### Adjusted Model Diagnostics

Now, for the moment of truth. The residuals are checked to determine any bias and variance issues with the model:
```{r resid-fit-2}
# Residuals vs fitted plot
plot(movies_adj.bas, ask=F, which=1)
```
Comparing this plot to the unadjusted model's residual plot, the increasing underestimation problem for low `imdb_rating` values has been eliminated. The average of the residuals for all predicted values is now zero signifying that there is no longer any major prediction bias in the model for this dataset. However, the complexity of the model has increased with the addition of a new variable and it is not clear that the variance in the model is unaffected by this adjustment. One should never adjust a model to decrease bias at the significant expense of variance when constructing models even though one's intuition may be to minimize bias alone. Thus, the error of the models will be assessed by comparing the predictive accuracy of `movies.bas` versus `movies_adj.bas` before conducting the final model diagnostics to ensure the adjusted model is an appropriate improvement over the original model.

<br>

### Comparing the Original Model to the Unadjusted Model

A cross-validation and resampling method will be used to compare the two model's accuracies. This process will facilitate the selection between the two models, `movies.bas` and `movies_adj.bas`. The cross-validation will follow a similar method used in this course's Bayesian Linear Regression supplemental Rstudio notebook. The data is split into training (90%) and test sets (10%) and the root mean-squared-errors (RMSEs) are determined from sixty different 90-10 splits of the data for the BMA predictions of each model. The RMSE estimate for the model will approximate the prediction error of the model for different datasets. The code will only be shown for the cross-validation of `movies.bas` as the same code with different randomization seed will be used to run the cross-validation for both models with the new variable, `imdb_rating_lo`, withheld or not depending on which model is being tested.

```{r unadj-boot, message=FALSE, warning=FALSE, cache=TRUE}
set.seed(123)

n = nrow(movies)
n_cv = 60
ape = matrix(NA, ncol=2, nrow=n_cv) # intialize output matrix
colnames(ape) = c("BMA", "AVG_RMSE")
# pb <- txtProgressBar(min = 0, max = n_cv, style = 3) # intialize progress bar
for (i in 1:n_cv) {
  # setTxtProgressBar(pb, i) # Update progress bar
  train = sample(1:n, size=round(.90*n), replace=FALSE) # Determine data split
  movies_train = movies[train,-18]
  movies_test = movies[-train,-18]

  bma_train_movies = bas.lm(audience_score ~ ., data=movies_train, 
                          prior="BIC", modelprior=uniform(), initprobs="eplogp") # Fit model to just training data
  yhat_bma = predict(bma_train_movies, movies_test, estimator="BMA")$fit # predict test data from train model
  ape[i, "BMA"] = cv.summary.bas(yhat_bma, movies_test$audience_score)
  ape[i, "AVG_RMSE"] = NA
  if (i > 1) {
      ape[i, "AVG_RMSE"] = mean(ape[, "BMA"], na.rm = TRUE)
    }
  
}

ape_nadj <- ape
```

```{r adj-boot, message=FALSE, warning=FALSE, include=FALSE, cache=TRUE, echo=FALSE}
set.seed(321)

n = nrow(movies)
n_cv = 60
ape = matrix(NA, ncol=2, nrow=n_cv)
colnames(ape) = c("BMA", "AVG_RMSE")
pb <- txtProgressBar(min = 0, max = n_cv, style = 3)
for (i in 1:n_cv) {
  setTxtProgressBar(pb, i)
  train = sample(1:n, size=round(.90*n), replace=FALSE)
  movies_train = movies[train,]
  movies_test = movies[-train,]

  bma_train_movies = bas.lm(audience_score ~ ., data=movies_train, 
                          prior="BIC", modelprior=uniform(), initprobs="eplogp")
  yhat_bma = predict(bma_train_movies, movies_test, estimator="BMA")$fit
  ape[i, "BMA"] = cv.summary.bas(yhat_bma, movies_test$audience_score)
  ape[i, "AVG_RMSE"] = NA
  if (i > 1) {
      ape[i, "AVG_RMSE"] = mean(ape[, "BMA"], na.rm = TRUE)
    }
  
}
```

First, the data from the cross-validation tests is combined and formatted so that the data is readily visualized with `ggplot2`. then, summary statistics are calculated for the test data. 
```{r run-plot}
ape_dat <- rbind(data.frame(run = 1:60, RMSE = ape_nadj[,1], running_RMSE = ape_nadj[,2], source = "Not Adj."), data.frame(run = 1:60, RMSE = ape[,1], running_RMSE = ape[,2], source = "Adj."))
ape_dat$source <- factor(ape_dat$source)
ape_dat$run <- factor(ape_dat$run)
# Find the n, mean, variance, and median for the RMSE for both models
sum_table <- ape_dat %>% 
            tbl_df %>%
            group_by(source) %>%  
            summarize(n = n(), Mean = mean(RMSE, na.rm = TRUE), 
                      Variance = var(RMSE, na.rm = TRUE), Median = median(running_RMSE, na.rm = TRUE))
# plot th data
ggplot(ape_dat, aes(x=run, y = running_RMSE, color = source)) +
    geom_point() +
    geom_hline(yintercept = sum_table[[1,5]], color = "#F8766D", lty = 2) +
    geom_text(aes(x = nrow(ape_dat)/2-4, y = sum_table[[1,5]]+0.05*sqrt(sum_table[[1,3]])), label = signif(sum_table[[1,5]], digits = 4), color = "#F8766D") +
    geom_hline(yintercept = sum_table[[2,5]], color = "#00BFC4", lty = 2) +
    geom_text(aes(x = nrow(ape_dat)/2-4, y = sum_table[[2,5]]+0.05*sqrt(sum_table[[2,3]])), label = signif(sum_table[[2,5]], digits = 4), color = "#00BFC4") +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(title = "Running Cross Validation RMSE Estimate by Iteration Number",x="Iteration Number",y = "Running Mean RMSE Estimate", color = "Model") +
    theme_bw()

```
The first plot of the data compares the mean estimates for the RMSE for each model as the number of iterations of the cross-validation increases to 60. The median is plotted as a dotted line for each running RMSE estimate for each model. This line is meant to represent a number near to the likely convergence value of the mean RMSE estimate for each model, or its central limit (the median is used since these central limits often begin as a skewed distribution of RMSE estimates). This plot is useful as a check on the mean estimation process as more iterations should cause convergence at a single value for the RMSE. The plots confirm that this did indeed happen after 60 iterations for both sets of data. 

<br>

Next, the distributions of each RMSE estimate is plotted for each model to compare the likely RMSE values.
```{r final-cv-report, fig.width = 8}
boxplot <- ggplot(data = ape_dat, aes(x = source, y = RMSE, fill = source, color = source)) +
  geom_boxplot(alpha =0.6) + 
    theme(legend.position=c(0.8, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Boxplot of RMSE by model", y = "RMSE", x = "model", fill = "model",color = "model")

density <- ggplot(ape_dat, aes(RMSE)) + geom_density(aes(fill = source,color = source), 
                                                      alpha = 0.6) +
        geom_vline(xintercept = sum_table[[1,3]], color = "#F8766D") +
    geom_vline(xintercept = sum_table[[1,3]] + sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[1,3]] - sqrt(sum_table[[1,4]]), color = "#F8766D", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]] , color = "#00BFC4") +
    geom_vline(xintercept = sum_table[[2,3]]  + sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    geom_vline(xintercept = sum_table[[2,3]]  - sqrt(sum_table[[2,4]]), color = "#00BFC4", lty = 2, alpha =0.6) +
    theme(legend.position=c(0.8, 0.8), plot.title = element_text(hjust = 0.5)) +
    labs(title = "Density of RMSE by model", y = "P(RMSE)", x = "RMSE", fill = "model",color = "model") 

grid.arrange(boxplot, density, ncol=2)
```
As these plots show, it appears that the adjusted model has a lower RMSE value.
<br>

A quick Bayesian inference test will determine if this difference is significant given the distribution of RMSE estimates from the cross validation:

```{r bayes-inf, message=FALSE, warning=FALSE}
bayes_inference(y = RMSE, x = source, data = ape_dat, statistic = "mean", type = "ht", null = 0, alternative = "twosided",show_plot=FALSE)
bayes_inference(y = RMSE, x = source, data = ape_dat, statistic = "mean", type = "ci", null = 0, alternative = "twosided", show_summ = FALSE)
```
The posterior probability that the two mean RMSE estimates are not equal is 1, thus the adjusted model has an RMSE that is -1.20 points lower than the RMSE for the original model with a 95% credible interval of -0.80 to -1.60 points for this estimated reduction in error. Since the adjusted model offers a significant improvement in predicted performance, it is selected over the original model.

<br>

### Final Adjusted Model Diagnostics

Continuing with the diagnostics for `movies_adj.bas`, the residuals are plotted to check for a normal distribution:
```{r qqplot}
# Quantile-Quantile Plot of Residuals
test <- predict(movies_adj.bas,estimator = "BMA")
resid <- movies$audience_score - test$fit
mu_resid <- mean(resid)
sd_resid <- sd(resid)
std_resid <- (resid-mu_resid)/sd_resid
par(mfrow=c(1,2))
qqnorm(std_resid, lty = 2)
qqline(std_resid)
plot(density(std_resid), main="Probability Density of Std. Residuals", 
  	xlab="Std. Residuals", ylab="P(Std. Residuals)")

```
The quantile-qunatile and density plots of the standardized residuals confirms the fitted vs residuals plot. For the majority of fitted values, the standardized residuals are normally distributed around zero. In assessing the tendency of the model's residuals, there is a slight right skew to the residuals which suggests the model tends to underestimate the `audience_score` rather than overestimate it. However, this is a relatively small departure from normality compared to the overall normal distribution of the residuals and does not invalidate the model.

<br>

Finally, the residuals are checked for constant variance across fitted values using a scale-location plot: 
```{r scale-loc}
sqrt_std_resid <- sqrt(abs(std_resid))
plot_dat <- data.frame(fitted = test$fit, resid = resid, imdb_rating = movies$critics_score,sqrt_std_resid = sqrt_std_resid)
ggplot(plot_dat, aes(x = fitted, y = sqrt_std_resid)) + geom_point() + 
    geom_smooth(color= "red", se = FALSE, lwd = 0.5) + 
    labs(title = "Scale-Location Plot for Adj. Model", y = "Sqrt(Std. Residuals)", x = "Fitted values") +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme_bw() + 
    xlim(15, 101)
```
While there is not a completely straight line across all the fitted residuals in the scale-location plot, overall this plot supports homoskedasticity in the model. There is slightly less variance for larger predicted values, but this could be due to a lack of movies in the data that have a very high `audience_score` rather than a serious heteroskedasticity in the model.

<br>

### Outlier Check

Finally, to fully diagnose the model, the model is checked for outliers. The outlier check uses the process outlined in this course's Bayesian Linear Regression supplemental Rstudio notebook. An outlier model is fitted to determine which specific rows in the data are likely to constitute outliers by being assigned a coefficient just for that individual row. Rows with a marginal inclusion probability greater than 0.5 will be listed in this code's results.
```{r outlier-fit, message=FALSE, warning=FALSE, results="asis", cache = TRUE}
set.seed(40)
n = nrow(movies)
movies_outliers = cbind(movies, diag(1, nrow=n))
outliers_movies = bas.lm(audience_score ~ ., data=movies_outliers, 
                        prior="BIC", a=n,
                        modelprior=tr.beta.binomial(a=1, b=1, trunc=n/2),
                        method="MCMC",
                        initprobs="marg-eplogp",
                        MCMC.iterations=2*10^6
                        )

pandoc.table(outliers_movies$namesx[outliers_movies$probne0 > .5])
```
From the outlier check, it appears rows 126, 216, and 251 are the significant outliers for the adjusted model.
<br>

The data for these rows is checked for any significant errors or obvious patterns that may suggest why there are such large residuals for these rows.

```{r outliers}
movies[c(126,216,251),]
```
No obvious errors in the data are present, suggesting that these movies should remain in the data and are valid movies. The likely reason why the model fails to predict the `audience_score` for these rows is the large differences between the movies' `imdb_rating` (when multiplied by ten to reach the same scale) and their `audience_score`. For some reason, users of Rotten Tomatoes give much different scores for these movies than users of the IMDB website. If it were possible to acquire more data, problems like these would be worth investigating, to see how common such a disagreement is between the two ratings. At just 3 out of 651 movies in this dataset, the probability of coming across such a movie is extremely small. Thus, these outliers do not threaten the validity of this model unless new data can prove such a problem is more common than 3 out of 651 movies.

<br>


### Final Model Summary

Since the model is considered valid and useful for predicting `audience_score`, the model is summarized to get a sense of what variables comprise the BMA.
```{r summary}
movies_adj.bas
summary(movies_adj.bas)
```
The coefficients have already been summarized in the "Check the Adjusted Model's Coefficients" section. The model posterior probabilities are interesting to note and a visual representation of the models and their included variables helps one to get a sense of the BMAs composition.
<br>

```{r image-model}
image(movies_adj.bas, rotate = FALSE)
```

The BMA is dominated by the first model with a posterior probability over double that of the second highest probability model. Thus, the performance of the Bayesian model averaging model (BMA), the highest probability model (HPM), the median probability model (HPM), and the best predictive model (BPM) are all expected to be similar. The BMA is the final model choice in this analysis since the BMA accounts for the greatest uncertainty in prediction and model fitting and, thus, is a good conservative choice if there are no performance or other reasons to prefer the models involving selection of specific variables and, thus, accounting for less uncertainty, the BPM, HPM, and MPM.


* * *

## Part 5: Prediction

For this part, the `audience_score` of two movies from 2016 are predicted using the adjusted BMA model.

The first movie to predict is "Arrival". The actual audience score for "Arrival" is an 82 on Rotten Tomatoes. The data for the `new_movie` data frame with "Arrival's" information was sourced from:
http://www.imdb.com/title/tt2543164/?ref_=nv_sr_1 and https://www.rottentomatoes.com/m/arrival_2016 on May, 22, 2016.

```{r predict,results="asis", cache=TRUE}
# Data for arrival
new_movie = data.frame(audience_score = 82, feature_film = "yes", drama = "yes", runtime=116, mpaa_rating_R = "no",thtr_rel_year = 2016, oscar_season = "yes", summer_season = "no", imdb_rating = 8.0, imdb_num_votes = 337591,critics_score=94, best_pic_nom = "yes", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "no")
new_movie <- mutate(new_movie, imdb_rating_lo = ifelse(imdb_rating < 4.6, 4.6-imdb_rating, 0))

prediction <- predict(movies_adj.bas, new_movie, estimator = "BMA", se.fit=TRUE)
ci_audience <- confint(prediction, parm="pred")
RMSE <- sum_table[[2,3]]
row1<- data.frame(type="95% cred.", lwr = ci_audience[1], pred = ci_audience[3], upr = ci_audience[2]) 
row2 <- data.frame(type ="RMSE", lwr = ci_audience[3] - RMSE, pred = ci_audience[3], upr = ci_audience[3] + RMSE)
pandoc.table(rbind(row1,row2))
```
The predicted `audience_score` for "Arrival" was 89.53 and both the 95% credible interval and predicted RMSE (from the cross-validation estimate) contain the actual `audience_score` value, 82. Thus, the model's prediction was not unusual and within the expectations for its accuracy. 

<br>

The first movie is "Suicide Squad". The actual audience score for "Suicide Squad" is an 62/100 on Rotten Tomatoes. The data for the `new_movie` data frame with "Suicide Squad's" information was sourced from:
http://www.imdb.com/title/tt1386697/?ref_=nv_sr_1 and https://www.rottentomatoes.com/m/suicide_squad_2016 on May, 22, 2016.
```{r predict-2, message=FALSE, warning=FALSE, results="asis", cache=TRUE}
new_movie = data.frame(audience_score = 62, feature_film = "yes", drama = "no", runtime=123, mpaa_rating_R = "no",thtr_rel_year = 2016, oscar_season = "no", summer_season = "no", imdb_rating = 6.2, imdb_num_votes = 391684,critics_score=25, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "no", imdb_rating_lo = "no")
new_movie <- mutate(new_movie, imdb_rating_lo = ifelse(imdb_rating < 4.6, 4.6-imdb_rating, 0))

prediction <- predict(movies.bas, new_movie, estimator = "BMA", se.fit=TRUE)
ci_audience <- confint(prediction, parm="pred")
row1<- data.frame(type="95% cred.", lwr = ci_audience[1], pred = ci_audience[3], upr = ci_audience[2]) 
row2 <- data.frame(type ="RMSE", lwr = ci_audience[3] - RMSE, pred = ci_audience[3], upr = ci_audience[3] + RMSE)
pandoc.table(rbind(row1,row2))
```
The predicted `audience_score` for "Suicide Squad" was 55.69 and both the 95% credible interval and predicted RMSE (from the cross-validation estimate) contain the actual `audience_score` value, 62. Thus, the model's prediction was not unusual and within the expectations for its accuracy.


* * *

## Part 6: Conclusion

Using Bayesian regression, it is possible to predict the Rotten Tomatoes Audience Score from the variables provided in the dataset and the new, derived variables given by the project requirements. While the BMA takes into account all variables, `imdb_rating` has a particularly strong relationship with `audience_score` and the BMA is weighted heavily by this variable. However, the first BMA fit to the data involved bias at low values of `imdb_rating` and underestimated the `audience_score` for most values of `imdb_rating` under 4.6. Thus, a new, derived variable was added to adjust the model for this issue and to increase the predictions for movies with low `imdb_rating` values. This adjustment did remove the low-end bias in the model and was compared to the original BMA by using a cross-validation procedure to estimate the RMSE of both models. The adjusted model did decrease bias and did not cause an increase in prediction error, so it was accepted as the final model. In diagnosing the outliers in the model's predictions, it appears the model fails to predict the `audience_score` when the `imdb_rating` times ten is not close in value to the `audience_score` signifying a disagreement in ratings between users of IMDB and Rotten Tomatoes. However, there are only a few movies with this issue in the data. Otherwise, the model makes predictions with an expected average error of about ±9 points with large 95% credible intervals.  

From this analysis, it is clear more information is needed to improve the accuracy of predicting a movie's popularity. Larger sample sizes may help, but even better changes would include advanced sampling designs that control for sources of confounding in the data and make the sample more representative of the typical movies being released. In addition, more specific and distinguishing features of movies should be added to the data. More factors that detail the general outline of what happens in the movie's story, or how much was spent on advertising for the film are possible features that affect popularity. Other questions such as if there are popularity effects for movie sequels are important to consider as well. Finally, greater representations of movies with extreme ratings by website users and where IMDB and Rotten Tomato users disagreed by a significant amount should be targeted in any future studies to address to edge cases where this model fails to predict the `audience_score`.