Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
aedobbyn committed Jun 21, 2017
1 parent 7f9ab13 commit 83b1c83
Show file tree
Hide file tree
Showing 15 changed files with 8,318 additions and 36 deletions.
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#### Overview

This is a preliminary, strictly for-fun foray into beer data science. Pairs well with most session IPAs.

All beer data was grabbed from the [BreweryDB API](http://www.brewerydb.com/developers) and dumped into a MySQL database. You can find the **main report in `compile.md`**.

The main question I went into the analysis with was: how well do beer styles actually describe the characteristics of beers within each style?

![](./brews.jpg)


#### Reproduce it

To grab the data yourself, you can create an API key on BreweryDB run the `run_it.R` script inside the `run_it` folder. For a quicker but less up-to-date solution (the BreweryDB database is updated pretty frequently), feel free to download `beer_necessities.csv`.

This analysis deals mainly with beer and its consituent components like ingredients (hops, malts) and other characteristics like bitterness and alcohol content. However, you can easily construct your own function for grabbing other things like breweries, glassware, locations, etc. by running the function generator in `construct_funcs.R`.


Any and all feedback is more than welcome. Cheers!

113 changes: 78 additions & 35 deletions compile.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ source("./most_popular_styles.R")


**Getting Beer**

~ The age-old dilemma ~

* The BreweryDB API returns a certain number of results per page; if we want
Expand Down Expand Up @@ -347,13 +348,15 @@ cluster_on <- c("abv", "ibu", "srm")
to_scale <- c("abv", "ibu", "srm")
response_vars <- c("name", "style", "styleId", "style_collapsed")
```


```{r, echo=TRUE}
clustered_beer <- cluster_it(df = popular_beer_dat,
preds = cluster_on,
to_scale = to_scale,
resp = response_vars,
n_centers = 10)
```


Expand Down Expand Up @@ -446,28 +449,32 @@ abv_ibu_clusters_vs_style_centers

### Ingredients

To get more granular with ingredients, we can split out each individual ingredient into its own column. If a beer or style contains that ingredient, its row gets a ` in that ingredient column and a 0 otherwise.

* Join the clustered beer on our main `beer_necessities` dataframe

```{r, eval=TRUE, echo=TRUE}
clustered_beer_necessities <- clustered_beer %>%
inner_join(beer_necessities)
```


* Parameters set at the outset
* This function takes a dataframe and two other parameters set at the outset:
* `ingredient_want`: this can be `hops`, `malt`, or other ingredients like `yeast` if we pull that in
* `grouper`: can be a vector of one or more things to group by
* `grouper`: can be a vector of one or more things to group by, like beer `name` or `style`

* We've already split ingredient number names out from the concatenated string into columns like `malt_name_1`, `malt_name_2`, etc.,

* Once ingredients have been split out from the concatenated string into columns like `malt_name_1`, `malt_name_2`, etc., we need to find the range of these columns; there will be a different number of malt columns than hops columns, for instance
* Now we need to find the range of these columns; there will be a different number of malt columns than hops columns, for instance
* The first one will be `<ingredient>_name_1`
* From this we can find the index of this column
* We get the name of last one with the `get_last_ing_name_col` function
* From this we can find the index of this column in our dataframe
* We get the name of last one with the `get_last_ing_name_col()` function
* Then we save a vector of all the ingredient column names in `ingredient_colnames`
* We make this a global variable because it will stay constant even if the indices change
* We make this variable global to the function because it will stay constant even if the indices change when we select out certain columns

* `to_keep_col_names` is a vector of all non-ingredient column names
* `to_keep_col_names` is a vector of all non-ingredient column names that we'll want to keep


* Inside `gather_ingredients` we:
* Inside `gather_ingredients()` we:
* Take out superflous column names that are not in `to_keep_col_names` or one of the ingredient columns
* Find what the new ingredient column indices are, since they'll have changed after we pared down
* Actually do the gathering: lump all of the ingredient columns (e.g., `hops_name_1`) into one long column, `ing_keys` and all the actual ingredient names (e.g., Cascade) into `ing_names`
Expand All @@ -477,18 +484,19 @@ clustered_beer_necessities <- clustered_beer %>%
* We'll use this vector of ingredient levels in `select_spread_cols()` below

* Then we spread the ingredient names
* We take what was previously the `value` in our gathered dataframe, the actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names
* The new `value` is `value` is count; it'll populate the row cells
* If a given row has a certain ingredient, it gets a 1 in the corresponding cell, an NA otherwise
* We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr))
* We take what was previously the `value` in our gathered dataframe, the actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names
* The new `value` is `value` is count; it'll populate the row cells
* If a given row has a certain ingredient, it gets a 1 in the corresponding cell, an NA otherwise
* We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr))

* Then we do the final step and group by the groupers


```{r, eval=TRUE, echo=TRUE}
pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
# ingredient_want <- ingredient_want
# ----------------------- Setup --------------------------- #
# First ingredient
first_ingredient_name <- paste(ingredient_want, "_name_1", sep="")
Expand All @@ -513,9 +521,10 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
# Non-ingredient column names we want to keep
to_keep_col_names <- c("cluster_assignment", "name", "abv", "ibu", "srm", "style", "style_collapsed")
# ---- Gather columns ----
# -------------------------------------------------------------------------------#
# ----------------------------- Gather columns --------------------------------- #
gather_ingredients <- function(df, cols_to_gather) {
to_keep_indices <- which(colnames(df) %in% to_keep_col_names)
Expand All @@ -535,6 +544,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
df_gathered
}
beer_gathered <- gather_ingredients(clustered_beer_necessities, ingredient_colnames) # ingredient colnames defined above function
# ------------------------------------------------------------------------------- #
# Get a vector of all ingredient levels
beer_gathered$ing_names <- factor(beer_gathered$ing_names)
Expand All @@ -546,7 +556,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
beer_gathered$ing_names <- as.character(beer_gathered$ing_names)
# ------ Spread columns -------
# ------------------------------- Spread columns -------------------------------- #
spread_ingredients <- function(df) {
df_spread <- df %>%
mutate(
Expand All @@ -559,8 +569,10 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
return(df_spread)
}
beer_spread <- spread_ingredients(beer_gathered)
# ------------------------------------------------------------------------------- #
# ------ Select only certain columns -------
# ------------------------- Select only certain columns ------------------------- #
select_spread_cols <- function(df) {
to_keep_col_indices <- which(colnames(df) %in% to_keep_col_names)
to_keep_ingredient_indices <- which(colnames(df) %in% ingredient_levels)
Expand All @@ -574,15 +586,15 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
return(new_df)
}
beer_spread_selected <- select_spread_cols(beer_spread)
# ------------------------------------------------------------------------------- #
# Take out all rows that have no ingredients specified at all
inds_to_remove <- apply(beer_spread_selected[, first_ingredient_index:last_ingredient_index],
1, function(x) all(is.na(x)))
beer_spread_no_na <- beer_spread_selected[ !inds_to_remove, ]
# Group ingredients by the grouper specified
# ----------------- Group ingredients by the grouper specified ------------------- #
get_ingredients_per_grouper <- function(df, grouper = grouper) {
df_grouped <- df %>%
ungroup() %>%
Expand Down Expand Up @@ -613,6 +625,7 @@ pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
return(per_grouper)
}
# ------------------------------------------------------------------------------- #
ingredients_per_grouper <- get_ingredients_per_grouper(beer_spread_selected, grouper)
return(ingredients_per_grouper)
Expand Down Expand Up @@ -662,16 +675,12 @@ beer_ingredients_join <- beer_ingredients_join %>%




Now we're left with something of a sparse matrix of all the ingredients compared to all the beers
```{r}
kable(beer_ingredients_join[1:20, ])
```

<!-- Per `style_collapsed` -->
<!-- ```{r} -->
<!-- kable(ingredients_per_style_collapsed[1:20, ]) -->
<!-- ``` -->



### Back to clustering: cluster on only 5 styles
Expand All @@ -693,11 +702,11 @@ certain_styles_clustered <- cluster_it(df = bn_certain_styles,
to_scale = to_scale,
resp = response_vars,
n_centers = 5)
```


table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment)
```{r}
kable(table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment))
ggplot() +
geom_point(data = certain_styles_clustered,
Expand Down Expand Up @@ -784,17 +793,23 @@ kable(pop_hops_beer_stats)
beer_necessities_w_popular_hops <- beer_necessities_w_hops %>%
filter(hop_name %in% pop_hops_beer_stats$hop_name) %>%
droplevels()
```

Are there certian hops that are used more often in very high IBU or ABV beers?
Hard to detect a pattern
```{r, echo = TRUE}
ggplot(data = beer_necessities_w_popular_hops) +
geom_point(aes(abv, ibu, colour = hop_name)) +
ggtitle("Beers Containing most Popular Hops") +
labs(x = "ABV", y = "IBU", colour = "Hop Name") +
theme_minimal()
```

```{r, echo=TRUE}
ggplot(data = pop_hops_beer_stats) +
geom_point(aes(mean_abv, mean_ibu, colour = hop_name, size = n)) +
ggtitle("Most Popular Hops' Effect on Alcohol and Bitterness") +
labs(x = "ABV", y = "IBU", colour = "Hop Name",
labs(x = "Mean ABV per Hop Type", y = "Mean IBU per Hop Type", colour = "Hop Name",
size = "Number of Beers") +
theme_minimal()
```
Expand All @@ -804,7 +819,7 @@ ggplot(data = pop_hops_beer_stats) +

* Can ABV, IBU, and SRM be used in a neural net to predict `style` or `style_collapsed`?
* In the function, specify the dataframe and the outcome, either `style` or `style_collapsed`; the one not specified as `outcome` will be dropped
* The predictor columns will be everything not specified in
* The predictor columns will be everything not specified in the vector `predictor_vars`

```{r, warning=FALSE, echo=TRUE, eval=TRUE, message=FALSE}
Expand Down Expand Up @@ -879,7 +894,8 @@ nn_collapsed_out$most_important_vars
```

* What about predicing `style`?

* What if we predcit `style` instead of `style_collapsed`?

```{r, echo=TRUE}
Expand All @@ -890,13 +906,33 @@ nn_notcollapsed_out$nn_accuracy
nn_notcollapsed_out$most_important_vars
```


And now if we drop `glass`?
```{r, echo=TRUE}
p_vars_no_glass <- c("total_hops", "total_malt", "abv", "ibu", "srm", "glass")
nn_collapsed_out_no_glass <- run_neural_net(df = beer_ingredients_join, outcome = "style_collapsed",
predictor_vars = p_vars_no_glass)
nn_collapsed_out_no_glass$nn_accuracy
nn_collapsed_out_no_glass$most_important_vars
```




### Random forest with all ingredients

* `glass` not included
* We can use a random forest to get even more granular with ingredients
* The sparse ingredient dataframe was too complex for the multinomial neural net; however, we can

* Here we don't include `glass` as a predictor

```{r, echo=TRUE}
library(ranger)
Expand Down Expand Up @@ -928,16 +964,23 @@ bi_train <- bi_train %>%
bi_rf <- ranger(style_collapsed ~ ., data = bi_train, importance = "impurity")
```

OOB (out of bag) prediction error is around 58%
* This calculated from tree samples constructed but not used in training set; these trees become effectively part of test set
```{r}
bi_rf
```


* Interestingly, ABV, IBU, and SRM are all much more important in the random forest than `total_hops` and `total_malt`
```{r, echo=TRUE}
importance(bi_rf)[1:10]
```


How does a CSRF (case-specific random forest) fare?

```{r, echo=TRUE}
Expand Down
Loading

0 comments on commit 83b1c83

Please sign in to comment.