update readme

acastrob · Jun 21, 2017 · 83b1c83 · 83b1c83
1 parent 7f9ab13
commit 83b1c83
Show file tree

Hide file tree

Showing 15 changed files with 8,318 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,20 @@
+#### Overview
+
+This is a preliminary, strictly for-fun foray into beer data science. Pairs well with most session IPAs.
+
+All beer data was grabbed from the [BreweryDB API](http://www.brewerydb.com/developers) and dumped into a MySQL database. You can find the **main report in `compile.md`**. 
+
+The main question I went into the analysis with was: how well do beer styles actually describe the characteristics of beers within each style?
+
+![](./brews.jpg)
+
+
+#### Reproduce it
+
+To grab the data yourself, you can create an API key on BreweryDB run the `run_it.R` script inside the `run_it` folder. For a quicker but less up-to-date solution (the BreweryDB database is updated pretty frequently), feel free to download `beer_necessities.csv`.
+
+This analysis deals mainly with beer and its consituent components like ingredients (hops, malts) and other characteristics like bitterness and alcohol content. However, you can easily construct your own function for grabbing other things like breweries, glassware, locations, etc. by running the function generator in `construct_funcs.R`.
+
+
+Any and all feedback is more than welcome. Cheers!
+
diff --git a/compile.Rmd b/compile.Rmd
@@ -81,6 +81,7 @@ source("./most_popular_styles.R")
 
 
 **Getting Beer**
+
 ~ The age-old dilemma ~
 
 * The BreweryDB API returns a certain number of results per page; if we want 
@@ -347,13 +348,15 @@ cluster_on <- c("abv", "ibu", "srm")
 to_scale <- c("abv", "ibu", "srm")
 response_vars <- c("name", "style", "styleId", "style_collapsed")
 
+```
+
 
+```{r, echo=TRUE}
 clustered_beer <- cluster_it(df = popular_beer_dat,
                              preds = cluster_on,
                              to_scale = to_scale,
                              resp = response_vars,
                              n_centers = 10)
-
 ```
 
 
@@ -446,28 +449,32 @@ abv_ibu_clusters_vs_style_centers
 
 ### Ingredients
 
+To get more granular with ingredients, we can split out each individual ingredient into its own column. If a beer or style contains that ingredient, its row gets a ` in that ingredient column and a 0 otherwise.
+
+* Join the clustered beer on our main `beer_necessities` dataframe 
+
 ```{r, eval=TRUE, echo=TRUE}
 clustered_beer_necessities <- clustered_beer %>% 
   inner_join(beer_necessities)
 
 ```
 
-
-* Parameters set at the outset
+* This function takes a dataframe and two other parameters set at the outset:
     * `ingredient_want`: this can be `hops`, `malt`, or other ingredients like `yeast` if we pull that in
-    * `grouper`: can be a vector of one or more things to group by 
+    * `grouper`: can be a vector of one or more things to group by, like beer `name` or `style`
+
+* We've already split ingredient number names out from the concatenated string into columns like `malt_name_1`, `malt_name_2`, etc., 
 
-* Once ingredients have been split out from the concatenated string into columns like `malt_name_1`, `malt_name_2`, etc., we need to find the range of these columns; there will be a different number of malt columns than hops columns, for instance
+* Now we need to find the range of these columns; there will be a different number of malt columns than hops columns, for instance
     * The first one will be `<ingredient>_name_1` 
-        * From this we can find the index of this column 
-    * We get the name of last one with the `get_last_ing_name_col` function
+        * From this we can find the index of this column in our dataframe
+    * We get the name of last one with the `get_last_ing_name_col()` function
 * Then we save a vector of all the ingredient column names in `ingredient_colnames`
-    * We make this a global variable because it will stay constant even if the indices change
+    * We make this variable global to the function because it will stay constant even if the indices change when we select out certain columns
 
-* `to_keep_col_names` is a vector of all non-ingredient column names
+* `to_keep_col_names` is a vector of all non-ingredient column names that we'll want to keep
 
-
-* Inside `gather_ingredients` we:
+* Inside `gather_ingredients()` we:
     * Take out superflous column names that are not in `to_keep_col_names` or one of the ingredient columns
     * Find what the new ingredient column indices are, since they'll have changed after we pared down
     * Actually do the gathering: lump all of the ingredient columns (e.g., `hops_name_1`) into one long column, `ing_keys` and all the actual ingredient names (e.g., Cascade) into `ing_names`
@@ -477,18 +484,19 @@ clustered_beer_necessities <- clustered_beer %>%
 * We'll use this vector of ingredient levels in `select_spread_cols()` below
 
 * Then we spread the ingredient names 
-* We take what was previously the `value` in our gathered dataframe, the actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names
-    * The new `value` is `value` is count; it'll populate the row cells
-        * If a given row has a certain ingredient, it gets a 1 in the corresponding cell, an NA otherwise
-* We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr))
+  * We take what was previously the `value` in our gathered dataframe, the actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names
+      * The new `value` is `value` is count; it'll populate the row cells
+          * If a given row has a certain ingredient, it gets a 1 in the corresponding cell, an NA otherwise
+  * We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr))
 
 * Then we do the final step and group by the groupers
 
 
 ```{r, eval=TRUE, echo=TRUE}
 
 pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
-  # ingredient_want <- ingredient_want
+  
+  # ----------------------- Setup --------------------------- #
   
   # First ingredient
   first_ingredient_name <- paste(ingredient_want, "_name_1", sep="")
@@ -513,9 +521,10 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
   
   # Non-ingredient column names we want to keep
   to_keep_col_names <- c("cluster_assignment", "name", "abv", "ibu", "srm", "style", "style_collapsed")
-
   
-  # ---- Gather columns ----
+  # -------------------------------------------------------------------------------# 
+  
+  # ----------------------------- Gather columns --------------------------------- #
   gather_ingredients <- function(df, cols_to_gather) {
     to_keep_indices <- which(colnames(df) %in% to_keep_col_names)
     
@@ -535,6 +544,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
     df_gathered
   }
   beer_gathered <- gather_ingredients(clustered_beer_necessities, ingredient_colnames)  # ingredient colnames defined above function
+  # ------------------------------------------------------------------------------- # 
   
   # Get a vector of all ingredient levels
   beer_gathered$ing_names <- factor(beer_gathered$ing_names)
@@ -546,7 +556,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
   
   beer_gathered$ing_names <- as.character(beer_gathered$ing_names)
   
-  # ------ Spread columns -------
+  # ------------------------------- Spread columns -------------------------------- #
   spread_ingredients <- function(df) {
     df_spread <- df %>% 
       mutate(
@@ -559,8 +569,10 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
     return(df_spread)
   }
   beer_spread <- spread_ingredients(beer_gathered)
+  # ------------------------------------------------------------------------------- # 
+
   
-  # ------ Select only certain columns -------
+  # ------------------------- Select only certain columns ------------------------- #
   select_spread_cols <- function(df) {
     to_keep_col_indices <- which(colnames(df) %in% to_keep_col_names)
     to_keep_ingredient_indices <- which(colnames(df) %in% ingredient_levels)
@@ -574,15 +586,15 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
     return(new_df)
   }
   beer_spread_selected <- select_spread_cols(beer_spread)
-  
-  
+  # ------------------------------------------------------------------------------- # 
+
   # Take out all rows that have no ingredients specified at all
   inds_to_remove <- apply(beer_spread_selected[, first_ingredient_index:last_ingredient_index], 
                           1, function(x) all(is.na(x)))
   beer_spread_no_na <- beer_spread_selected[ !inds_to_remove, ]
   
   
-  # Group ingredients by the grouper specified 
+  # ----------------- Group ingredients by the grouper specified ------------------- #
   get_ingredients_per_grouper <- function(df, grouper = grouper) {
     df_grouped <- df %>%
       ungroup() %>% 
@@ -613,6 +625,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
     
     return(per_grouper)
   }
+  # ------------------------------------------------------------------------------- # 
   
   ingredients_per_grouper <- get_ingredients_per_grouper(beer_spread_selected, grouper)
   return(ingredients_per_grouper)
@@ -662,16 +675,12 @@ beer_ingredients_join <- beer_ingredients_join %>%
 
 
 
-
 Now we're left with something of a sparse matrix of all the ingredients compared to all the beers
 ```{r}
 kable(beer_ingredients_join[1:20, ])
 ```
 
-<!-- Per `style_collapsed` -->
-<!-- ```{r} -->
-<!-- kable(ingredients_per_style_collapsed[1:20, ]) -->
-<!-- ``` -->
+
 
 
 ### Back to clustering: cluster on only 5 styles
@@ -693,11 +702,11 @@ certain_styles_clustered <- cluster_it(df = bn_certain_styles,
                                  to_scale = to_scale,
                                  resp = response_vars,
                                  n_centers = 5)
+```
 
 
-
-
-table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment)
+```{r}
+kable(table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment))
 
 ggplot() +
   geom_point(data = certain_styles_clustered,
@@ -784,17 +793,23 @@ kable(pop_hops_beer_stats)
 beer_necessities_w_popular_hops <- beer_necessities_w_hops %>% 
   filter(hop_name %in% pop_hops_beer_stats$hop_name) %>% 
   droplevels() 
+```
 
+Are there certian hops that are used more often in very high IBU or ABV beers?
+Hard to detect a pattern
+```{r, echo = TRUE}
 ggplot(data = beer_necessities_w_popular_hops) + 
   geom_point(aes(abv, ibu, colour = hop_name)) +
   ggtitle("Beers Containing most Popular Hops") +
   labs(x = "ABV", y = "IBU", colour = "Hop Name") +
   theme_minimal()
+```
 
+```{r, echo=TRUE}
 ggplot(data = pop_hops_beer_stats) + 
   geom_point(aes(mean_abv, mean_ibu, colour = hop_name, size = n)) +
   ggtitle("Most Popular Hops' Effect on Alcohol and Bitterness") +
-  labs(x = "ABV", y = "IBU", colour = "Hop Name", 
+  labs(x = "Mean ABV per Hop Type", y = "Mean IBU per Hop Type", colour = "Hop Name", 
        size = "Number of Beers") +
   theme_minimal()
 ```
@@ -804,7 +819,7 @@ ggplot(data = pop_hops_beer_stats) +
 
 * Can ABV, IBU, and SRM be used in a neural net to predict `style` or `style_collapsed`?
 * In the function, specify the dataframe and the outcome, either `style` or `style_collapsed`; the one not specified as `outcome` will be dropped
-* The predictor columns will be everything not specified in 
+* The predictor columns will be everything not specified in the vector `predictor_vars`
 
 ```{r, warning=FALSE, echo=TRUE, eval=TRUE, message=FALSE}
 
@@ -879,7 +894,8 @@ nn_collapsed_out$most_important_vars
 
 ```
 
-* What about predicing `style`?
+
+* What if we predcit `style` instead of `style_collapsed`?
 
 ```{r, echo=TRUE}
 
@@ -890,13 +906,33 @@ nn_notcollapsed_out$nn_accuracy
 
 nn_notcollapsed_out$most_important_vars
 
+```
+
+
+And now if we drop `glass`?
+```{r, echo=TRUE}
+
+p_vars_no_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm", "glass")
+
+nn_collapsed_out_no_glass <- run_neural_net(df = beer_ingredients_join, outcome = "style_collapsed", 
+                         predictor_vars = p_vars_no_glass)
+
+nn_collapsed_out_no_glass$nn_accuracy
+
+nn_collapsed_out_no_glass$most_important_vars
 
 ```
 
 
+
+
 ### Random forest with all ingredients
 
-* `glass` not included
+* We can use a random forest to get even more granular with ingredients
+    * The sparse ingredient dataframe was too complex for the multinomial neural net; however, we can 
+
+* Here we don't include `glass` as a predictor
+
 ```{r, echo=TRUE}
 
 library(ranger)
@@ -928,16 +964,23 @@ bi_train <- bi_train %>%
 
 
 bi_rf <- ranger(style_collapsed ~ ., data = bi_train, importance = "impurity")
+```
+
+OOB (out of bag) prediction error is around 58%
+    * This calculated from tree samples constructed but not used in training set; these trees become effectively part of test set
+```{r}
 bi_rf
 ```
 
+
 * Interestingly, ABV, IBU, and SRM are all much more important in the random forest than `total_hops` and `total_malt`
 ```{r, echo=TRUE}
 
 importance(bi_rf)[1:10]
 ```
 
 
+How does a CSRF (case-specific random forest) fare?
 
 ```{r, echo=TRUE}