forked from aedobbyn/beer-data-science
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcompile.Rmd
1219 lines (838 loc) · 43.3 KB
/
compile.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Data Science Musings on Beer
author:
name: Amanda Dobbyn
output:
html_notebook:
toc: false
theme: yeti
pdf_document:
keep_tex: true
toc: false
github_document:
toc: true
# output:
# html_document:
# keep_md: true
# toc: false
# theme: yeti
# github_document:
# toc: true
---
```{r, eval=FALSE, echo=FALSE}
# If need to close all connections
lapply( dbListConnections( dbDriver( drv = "MySQL")), dbDisconnect)
```
```{r setup, include=FALSE}
library(knitr)
library(broom)
# knitr::opts_chunk$set(cache=TRUE)
# knitr::opts_knit$set(root.dir=normalizePath('../'))
knitr::opts_chunk$set(echo = FALSE, warning = FALSE)
knitr::opts_chunk$set(fig.width=12, fig.height=8)
options(knitr.table.format = 'markdown')
```
Check out the [clustering Shiny app](https://amandadobbyn.shinyapps.io/clusterfun/)!
### Motivation and Overview
This is a first pass exploration of different aspects of beer. The data was collected via the [BreweryDB](http://www.brewerydb.com/developers) API. Special thanks to [Kris Kroski](https://kro.ski/) for data ideation and co-membership in the honourable workplace beer consortium.
The main question this analysis is meant to tackle is: Are beer styles actually indicative of shared attributes of the beers within that style? Or are style boundaries more or less arbitrary? I took two approaches to this: unsupervised clustering and supervised prediction.
Clusters defined by the algorithm were compared to the style "centers" as defined by the mean ABV, IBU, and SRM. On the prediction side, predictor variables for include ABV (alcohol by volume), IBU (international bitterness units), SRM ([a measure of color](http://www.twobeerdudes.com/beer/srm)) as well as ingredients like hops and malts. The outcome variable is the style that beer was assigned.
This document starts off with an explanation of how I sourced beer data from BreweryDB, cleaned that data, and stuck the parts of it I wanted in a database. (These are just the highlights; the code actually executed in this document queries that database, specifically by sourcing the file `read_from_db.R`, also in this repo, rather than hitting the BreweryDB API. This is done for expediency's sake as the code below detailing how to actually get the beer data, run in full in `run_it.R`, takes some time to execute.)
It then moves into clustering (k-means) and prediction (neural net, random forest). Below is a more detailed overview of the general workflow.
### Workflow
**1. Get and Prepare**
When we first hit the BreweryDB API to iteratively pull in all beers and their ingredients along with other things we might want like breweries and glassware. Then we unnest the JSON responses, including all the ingredients columns, and dump this all into a MySQL database.
Next, we create a `style_collapsed` column to reduce the number of levels of our outcome variable. We do this by `grep`ing through each beer's style to determine if that style contains a keyword that qualifies it to be rolled into a collapsed style; if it does, it gets that keyword in a `style_collapsed` column.
Finally we unnest the ingredients `hops` and `malts` into a wide, sparse dataframe. Individual ingredients are now columns, with each beer still in its own rows; a cell gets a 1 if ingredient is present and 0 otherwise. This allows more granual inference into ingredients' effects on both style and bitterness (occasioning a short foray into hops).
**2. Short foray into hops**
A quick look at the most popular hops and an exploration of the relationship between hops and bitterness.
**3. Infer**
Cluster: unsupervised k-means clustering partitioning the entire dataset into ten clusters. Next, we cluster on a dataset composed of just five selected styles into five clusters.
We then attempt to predict predict either `style` or `style_collapsed` using a neural net and a random forest. The main predictors are ABV, IBU, SRM, total number of hops, and total number of malts. The glass a beer is served in is also considered.
### Thoughts
The question of what should be a predictor variable for style is a bit murky here. What should be fair game for predicting style and what shouldn't? Characteristics of a beer that are defined *by* its style would seem to be "cheating" in a way. The only "inputs" to a beer we have in our dataset are its ingredients, primarly hops and malts. While these certainly have an effect on its flavor profile, I consider them semi-cheating because if style is determined beforehand it likely determines at least in part which ingredients are added. The main candidates in my mind are ABV, IBU, and SRM. These are "outputs" of a beer (in the sense that they can only be exactly determined once a beer is brewered) that meaningfully define it. While correlated ABV is correlated with both IBU and SRM, the three are theoretically orthogonal to each other. A style-defined attribute like glass type is a bad candidate for a predictor variable because it is completely decoupled from the beer itself and determined entirely by the style the beer has been assigned to.
The answer thus far seems to be that the beer landscape is more of a spectrum than a collection of neatly differentiated styles. Beer-intrinsic attributes like bitterness aren't great predictors of style. The relative importance of different variables depeends on the prediction method used. However, one style-defined attribute, the glass a beer is served in, increased the accuracy of prediction substantially.
Of course, other important aspects of the flavor, body, smell, etc. of the beers could not be considered because this data is not available from BreweryDB.
![](./taps.jpg)
***
```{r, echo=TRUE, message=FALSE}
source("./read_from_db.R")
```
## Get and Prepare Data
**GETting beer, the age-old dilemma**
* The BreweryDB API returns a certain number of results per page; if we want
* So, we hit the BreweryDB API and ask for `1:number_of_pages`
* We can change `number_of_pages` to, e.g., 3 if we only want the first 3 pages
* If there's only one page (as is the case for the glassware endpoint), numberOfPages won't be returned, so in this case we set number_of_pages to 1
* The `addition` parameter can be an empty string if nothing else is needed
```{r, eval = FALSE, echo=TRUE}
base_url <- "http://api.brewerydb.com/v2"
key_preface <- "/?key="
paginated_request <- function(ep, addition, trace_progress = TRUE) {
full_request <- NULL
first_page <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
, "&p=1"))
number_of_pages <- ifelse(!(is.null(first_page$numberOfPages)),
first_page$numberOfPages, 1)
for (page in 1:number_of_pages) {
this_request <- fromJSON(paste0(base_url, "/", ep, "/", key_preface, key
, "&p=", page, addition),
flatten = TRUE)
this_req_unnested <- unnest_it(this_request) # <- request unnested here
if(trace_progress == TRUE) {message(paste0("Page ", this_req_unnested$currentPage))} # if TRUE, print the page we're on
full_request <- bind_rows(full_request, this_req_unnested[["data"]])
}
return(full_request)
}
all_beer_raw <- paginated_request("beers", "&withIngredients=Y")
```
* Function for unnesting JSON used inside `paginated_request()` below
+ Takes the column named `name` nested within a column in the data portion of the response
+ If the `name` column doesn't exist, it takes the first nested column
* We use something similar to unnest ingredient like all of a beer's hops and malts into a long string contained in `hops_name` and `malt_name`
```{r, eval=FALSE, echo=TRUE}
unnest_it <- function(df) {
unnested <- df
for(col in seq_along(df[["data"]])) {
if(! is.null(ncol(df[["data"]][[col]]))) {
if(! is.null(df[["data"]][[col]][["name"]])) {
unnested[["data"]][[col]] <- df[["data"]][[col]][["name"]]
} else {
unnested[["data"]][[col]] <- df[["data"]][[col]][[1]]
}
}
}
return(unnested)
}
```
**Collapse Styles**
* Save the most popular styles in `keywords`
* Loop through each keyword
* For each beer, `grep` through its style column to see if it contains any one of these keywords
* If it does, give it that keyword in a new column `style_collapsed`
* If a beer's name matches multiple keywords, e.g., American Double India Pale Ale would match Double India Pale Ale, India Pale Ale, and Pale Ale, its `style_collapsed` is the **last** of those that appear in keywords
* This is why keywords are intentionally ordered from most general to most specific
* So in the case of an case of American Double India Pale Ale: since Double India Pale Ale appears in `keywords` after India Pale Ale and Pale Ale, an American Double India Pale Ale would get a `style_collapsed` of Double India Pale Ale
* If no keyword is contained in `style`, `style_collapsed` is just whatever's in `style`; in other words, it doesn't get collpsed into a bigger bucket
* This isn't a huge problem because we'll pare down to just the most popular styles later, however we could think about creating a catchall "Other" level for `style_collapsed`
A list of our main styles to collapse to:
```{r, echo=TRUE}
keywords <- c("Lager", "Pale Ale", "India Pale Ale", "Double India Pale Ale", "India Pale Lager", "Hefeweizen", "Barrel-Aged","Wheat", "Pilsner", "Pilsener", "Amber", "Golden", "Blonde", "Brown", "Black", "Stout", "Porter", "Red", "Sour", "Kölsch", "Tripel", "Bitter", "Saison", "Strong Ale", "Barley Wine", "Dubbel", "Altbier")
keyword_df <- as_tibble(list(`Main Styles` = keywords))
kable(keyword_df)
```
```{r, eval=FALSE, echo=TRUE}
collapse_styles <- function(df, trace_progress = TRUE) {
df[["style_collapsed"]] <- vector(length = nrow(df))
for (beer in 1:nrow(df)) {
if (grepl(paste(keywords, collapse="|"), df$style[beer])) {
for (keyword in keywords) {
if(grepl(keyword, df$style[beer]) == TRUE) {
df$style_collapsed[beer] <- keyword
}
}
} else {
df$style_collapsed[beer] <- as.character(df$style[beer])
}
if(trace_progress == TRUE) {message(paste0("Collapsing this ", df$style[beer], " to: ", df$style_collapsed[beer]))}
}
return(df)
}
```
* Then we collapse further; right now we just combine all wheaty bears into Wheat and Pils-like beers into Pilsener (with two e's) by `fct_collapse`ing those levels
```{r, echo=TRUE, eval=FALSE}
collapse_further <- function(df) {
df[["style_collapsed"]] <- df[["style_collapsed"]] %>%
fct_collapse(
"Wheat" = c("Hefeweizen", "Wheat"),
"Pilsener" = c("Pilsner", "American-Style Pilsener") # pilsener == pilsner == pils
)
return(df)
}
```
**Split out Ingredients**
When we unnested ingredients, we just concatenated all of the ingredients for a given beer into a long string. If we want, we can split out the ingredients that were concatenated in `<ingredient>_name` with this `split_ingredients` function.
This takes a vector of `ingredients_to_split`, so e.g. `c("hops_name", "malt_name")` and creates one column for each type of ingredient (`hops_name_1`, `hops_name_2`, etc.). It's flexible enough to adapt if data in BreweryDB changes and a beer now has 15 hops where before the maximum number of hops a beer had was 10.
```{r, eval=FALSE, echo=TRUE}
split_ingredients <- function(df, ingredients_to_split) {
ncol_df <- ncol(df)
for (ingredient in ingredients_to_split) {
ingredient_split <- str_split(df[[ingredient]], ", ")
num_new_cols <- max(lengths(ingredient_split))
for (num in 1:num_new_cols) {
this_col <- ncol_df + 1
df[, this_col] <- NA
names(df)[this_col] <- paste0(ingredient, "_", num)
ncol_df <- ncol(df)
for (row in seq_along(ingredient_split)) {
if (!is.null(ingredient_split[[row]][num])) {
df[row, this_col] <- ingredient_split[[row]][num]
}
}
df[[names(df)[this_col]]] <- factor(df[[names(df)[this_col]]])
}
ncol_df <- ncol(df)
}
return(df)
}
```
Some quick summary stats on our main dataframe called `beer_necessities`:
```{r, echo=TRUE}
dim(beer_necessities)
str(beer_necessities)
```
\newpage
**Find the Most Popualar Styles**
We find mean ABV, IBU, and SRM per collapsed style and arrange collapsed styles by the number of beers that fall into them. (This is of course dependent on how we collapse styles; if we looped all Double IPAs in with IPAs then the category IPA would be much bigger than it is if we keep the two separate.)
```{r, eval=TRUE, echo=TRUE}
library(forcats)
# Pare down to only cases where style is not NA
beer_dat_pared <- beer_necessities[complete.cases(beer_necessities$style), ]
# Arrange beer dat by style popularity
style_popularity <- beer_dat_pared %>%
group_by(style) %>%
count() %>%
arrange(desc(n))
# Add a column that scales popularity
style_popularity <- bind_cols(style_popularity,
n_scaled = as.vector(scale(style_popularity$n)))
# Find styles that are above a z-score of 0
popular_styles <- style_popularity %>%
filter(n_scaled > 0)
# Pare dat down to only beers that fall into those styles
popular_beer_dat <- beer_dat_pared %>%
filter(
style %in% popular_styles$style
) %>%
droplevels() %>%
as_tibble()
```
How many rows do we have in our dataset of just beers that fall into the popular styles? (In the original dataset we had `r nrow(beer_necessities)`)
```{r, echo=TRUE}
nrow(popular_beer_dat)
```
Now we find what I'm calling the "style centers" for each of these most popular styles. The center is defined by the mean ABV, mean IBU, and mean SRM of all of the beers in that style.
Styles that appear here that did not appear in the keywords that we collapsed to are the most popular styles that did not contain one of those keywords.
```{r, echo=TRUE}
# Find the centers (mean abv, ibu, srm) of the most popular styles
style_centers <- popular_beer_dat %>%
group_by(style_collapsed) %>%
add_count() %>%
summarise(
mean_abv = mean(abv, na.rm = TRUE) %>% round(., digits = 2),
mean_ibu = mean(ibu, na.rm = TRUE) %>% round(., digits = 2),
mean_srm = mean(srm, na.rm = TRUE) %>% round(., digits = 2),
n = median(n, na.rm = TRUE) # Median here only for summarise. Should be just the same as n
) %>%
arrange(desc(n)) %>%
drop_na() %>%
droplevels()
# Give some nicer names
style_centers_rename <- style_centers %>%
rename(
`Collapsed Style` = style_collapsed,
`Mean ABV` = mean_abv,
`Mean IBU` = mean_ibu,
`Mean SRM` = mean_srm,
`Numer of Beers` = n
)
```
Take a look at the table, ordered by number of beers in that style, descending.
```{r}
kable(style_centers_rename)
```
***
\newpage
### Ingredients
The lifecycle of ingredients thus far has been to unnest them from the raw JSON into a long string contained in `hops_name` and `malt_name`. Next each ingredient in each of those columns was split out into `hops_name_1`, `hops_name_2`, etc.
To get more granular with ingredients, we can further split out each individual ingredient into its own column. If a beer or style contains that ingredient, its row gets a 1 in that ingredient column and a 0 otherwise.
From this, we can find the total number of hops and malts per grouper.
* This function takes a dataframe and two other parameters set at the outset:
* `ingredient_want`: this can be `hops`, `malt`, or other ingredients like `yeast` if we pull that in
* `grouper`: can be a vector of one or more things to group by, like beer `id` or `style`
* Careful with using `name` as a grouper as multiple beers have the same name
```{r, eval=TRUE, echo=TRUE}
pick_ingredient_get_beer <- function (ingredient_want, df, grouper) {
# ----------------------- Setup --------------------------- #
# We've already split ingredient number names out from the concatenated string into columns like `malt_name_1`,
# `malt_name_2`, etc. We need to find the range of these columns; there will be a different number of malt
# columns than hops columns, for instance. The first one will be `<ingredient>_name_1` and from this we can find
# the index of this column in our dataframe. We get the name of last one with the `get_last_ing_name_col()`
# function. Then we save a vector of all the ingredient column names in `ingredient_colnames`. It will stay
# constant even if the indices change when we select out certain columns.
# First ingredient
first_ingredient_name <- paste(ingredient_want, "_name_1", sep="")
first_ingredient_index <- which(colnames(df)==first_ingredient_name)
# Get the last ingredient
get_last_ing_name_col <- function(df) {
for (col in names(df)) {
if (grepl(paste(ingredient_want, "_name_", sep = ""), col) == TRUE) {
name_last_ing_col <- col
}
}
return(name_last_ing_col)
}
# Last ingredient
last_ingredient_name <- get_last_ing_name_col(df)
last_ingredient_index <- which(colnames(df)==last_ingredient_name)
# Vector of all the ingredient column names
ingredient_colnames <- names(df)[first_ingredient_index:last_ingredient_index]
# Non-ingredient column names we want to keep
to_keep_col_names <- c("id", "cluster_assignment", "name", "abv", "ibu", "srm", "style", "style_collapsed")
# -------------------------------------------------------------------------------#
# Inside `gather_ingredients()` we take out superflous column names that are not in `to_keep_col_names` or one
# of the ingredient columns, find what the new ingredient column indices are, since they'll have changed after
# we pared down and then gather all of the ingredient columns (e.g., `hops_name_1`) into one long column,
# `ing_keys` and all the actual ingredient names (e.g., Cascade) into `ing_names`.
# ----------------------------- Gather columns --------------------------------- #
gather_ingredients <- function(df, cols_to_gather) {
to_keep_indices <- which(colnames(df) %in% to_keep_col_names)
selected_df <- df[, c(to_keep_indices, first_ingredient_index:last_ingredient_index)]
new_ing_indices <- which(colnames(selected_df) %in% cols_to_gather) # indices will have changed since we pared down
df_gathered <- selected_df %>%
gather_(
key_col = "ing_keys",
value_col = "ing_names",
gather_cols = colnames(selected_df)[new_ing_indices]
) %>%
mutate(
count = 1
)
return(df_gathered)
}
beer_gathered <- gather_ingredients(df, ingredient_colnames) # ingredient colnames defined above function
# ------------------------------------------------------------------------------- #
# Next we get a vector of all ingredient levels and take out the one that's an empty string and
# use this vector of ingredient levels in `select_spread_cols()` below
# Get a vector of all ingredient levels
beer_gathered$ing_names <- factor(beer_gathered$ing_names)
ingredient_levels <- levels(beer_gathered$ing_names)
# Take out the level that's just an empty string
to_keep_levels <- !(c(1:length(ingredient_levels)) %in% which(ingredient_levels == ""))
ingredient_levels <- ingredient_levels[to_keep_levels]
beer_gathered$ing_names <- as.character(beer_gathered$ing_names)
# ----------------------------------------------------------------------------- #
# Then we spread the ingredient names: we take what was previously the `value` in our gathered dataframe, the
# actual ingredient names (Cascade, Centennial) and make that our `key`; it'll form the new column names. The
# new `value` is `value` is count; it'll populate the row cells. If a given row has a certain ingredient, it
# gets a 1 in the corresponding cell, an NA otherwise.
# We add a unique idenfitier for each row with `row`, which we'll drop later (see [Hadley's SO
# comment](https://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr)).
# ------------------------------- Spread columns -------------------------------- #
spread_ingredients <- function(df) {
df_spread <- df %>%
mutate(
row = 1:nrow(df) # Add a unique idenfitier for each row which we'll need in order to spread; we'll drop this later
) %>%
spread(
key = ing_names,
value = count
)
return(df_spread)
}
beer_spread <- spread_ingredients(beer_gathered)
# ------------------------------------------------------------------------------- #
# ------------------------- Select only certain columns ------------------------- #
select_spread_cols <- function(df) {
to_keep_col_indices <- which(colnames(df) %in% to_keep_col_names)
to_keep_ingredient_indices <- which(colnames(df) %in% ingredient_levels)
to_keep_inds_all <- c(to_keep_col_indices, to_keep_ingredient_indices)
new_df <- df %>%
select_(
.dots = to_keep_inds_all
)
return(new_df)
}
beer_spread_selected <- select_spread_cols(beer_spread)
# ------------------------------------------------------------------------------- #
# Take out all rows that have no ingredients specified at all
inds_to_remove <- apply(beer_spread_selected[, first_ingredient_index:last_ingredient_index],
1, function(x) all(is.na(x)))
beer_spread_no_na <- beer_spread_selected[ !inds_to_remove, ]
# ----------------- Group ingredients by the grouper specified ------------------- #
# Then we do the final step and group by the groupers.
get_ingredients_per_grouper <- function(df, grouper = grouper) {
df_grouped <- df %>%
ungroup() %>%
group_by_(grouper)
not_for_summing <- which(colnames(df_grouped) %in% to_keep_col_names)
max_not_for_summing <- max(not_for_summing)
per_grouper <- df_grouped %>%
select(-c(abv, ibu, srm)) %>% # taking out temporarily
summarise_if(
is.numeric,
sum, na.rm = TRUE
) %>%
mutate(
total = rowSums(.[(max_not_for_summing + 1):ncol(.)], na.rm = TRUE)
)
# Send total to the second position
per_grouper <- per_grouper %>%
select(
id, total, everything()
)
# Replace total column with more descriptive name: total_<ingredient>
names(per_grouper)[which(names(per_grouper) == "total")] <- paste0("total_", ingredient_want)
return(per_grouper)
}
# ------------------------------------------------------------------------------- #
ingredients_per_grouper <- get_ingredients_per_grouper(beer_spread_selected, grouper)
return(ingredients_per_grouper)
}
```
Now run the function with `ingredient_want` as first hops, and then malt. Then join the resulting dataframes and remove/reorder some columns.
```{r, echo=TRUE, eval=TRUE}
# Run the entire function with ingredient_want set to hops, grouping by name
ingredients_per_beer_hops <- pick_ingredient_get_beer(ingredient_want = "hops",
beer_necessities,
grouper = c("id"))
# Same for malt
ingredients_per_beer_malt <- pick_ingredient_get_beer(ingredient_want = "malt",
beer_necessities,
grouper = c("id"))
# Join those on our original dataframe by name
beer_ingredients_join_first_ingredient <- left_join(beer_necessities, ingredients_per_beer_hops,
by = "id")
beer_ingredients_join_all <- left_join(beer_ingredients_join_first_ingredient, ingredients_per_beer_malt,
by = "id")
# Take out some unnecessary columns
unnecessary_cols <- c("styleId", "abv_scaled", "ibu_scaled", "srm_scaled",
"hops_id", "malt_id", "glasswareId", "style.categoryId")
beer_ingredients_join_all <- beer_ingredients_join_all[, (! names(beer_ingredients_join_all) %in% unnecessary_cols)]
# If we also want to take out any of the malt_name_1, malt_name_2, etc. columns we can do this with a grep
more_unnecessary <- c("hops_name_|malt_name_")
beer_ingredients_join_all <-
beer_ingredients_join_all[, (! grepl(more_unnecessary, names(beer_ingredients_join_all)) == TRUE)]
# Reorder columns a bit
beer_ingredients_join_all <- beer_ingredients_join_all %>%
select(
id, name, total_hops, total_malt, everything(), -description
)
# Keep only beers that fall into a style_collapsed bucket
# We're not filtering by levels in beer_necessities$style_collapsed because those levels contain more than what's in just the keywords of collapse_styles()
beer_ingredients_join <- beer_ingredients_join_all %>%
filter(
style_collapsed %in% levels(style_centers$style_collapsed)
) %>%
droplevels()
# And get a df that includes total_hops and total_malt but not all the other ingredient columns
beer_totals_all <- beer_ingredients_join_all %>%
select(
id, name, total_hops, total_malt, style, style_collapsed,
abv, ibu, srm, glass, hops_name, malt_name
)
# And just styles in style_collapsed
beer_totals <- beer_ingredients_join %>%
filter(
style_collapsed %in% levels(style_centers$style_collapsed)
) %>%
droplevels()
```
Now we're left with something of a sparse matrix of all the ingredients compared to all the beers. Scroll right to see the extent of the granularity this affords us.
```{r, echo=TRUE}
kable(beer_ingredients_join[1:20, ])
```
\newpage
***
Now that the munging is done, onto the main question: to what extent do natural clusters in beer align with style boundaries?
### Unsupervised Clustering
We K-means cluster beers based on certain numeric predictor variables. The data we'll use is includes all beers as well as the total number of hops and malts in each beer.
**Prep**
We'll prep the data for clustering first in order to
Here we define a funciton that takes a dataframe, a set of predictors, a set of variables to scale, and a response variable
We select only the response variable(s) and the variables to cluster on. If there are any missing values, we remove them. (NB: There are not not very many beers have SRM so we may not want to omit based on it.)
* Take out outliers, defined as beers have to have an ABV between 3 and 20 and an IBU less than 200
* Then cluster on just the predictors and compare to the response variable
```{r, echo=TRUE}
library(NbClust)
prep_clusters <- function(df, preds, to_scale, resp) {
df_for_clustering <- df %>%
select_(.dots = c(response_vars, cluster_on)) %>%
na.omit() %>%
filter(
abv < 20 & abv > 3 # Only keep beers with ABV between 3 and 20 and an IBU less than 200
) %>%
filter(
ibu < 200
)
df_all_preds <- df_for_clustering %>%
select_(.dots = preds)
df_preds_scale <- df_all_preds %>%
select_(.dots = to_scale) %>%
rename(
abv_scaled = abv,
ibu_scaled = ibu,
srm_scaled = srm
) %>%
scale() %>%
as_tibble()
df_preds <- bind_cols(df_preds_scale, df_all_preds[, (!names(df_all_preds) %in% to_scale)])
df_outcome <- df_for_clustering %>%
select_(.dots = resp) %>%
na.omit()
cluster_prep_out <- list(df_for_clustering = df_for_clustering, preds = df_preds, outcome = df_outcome)
return(cluster_prep_out)
}
```
Prep the clusters. We'll cluster on the predictors ABV, IBU, SRM, total number of hops, and total number of malts.
```{r, echo=TRUE}
cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")
cluster_prep <- prep_clusters(df = beer_totals,
preds = cluster_on,
to_scale = to_scale,
resp = response_vars)
```
After prepping, we're left with `r nrow(cluster_prep)` beers to cluster on.
Before clustering, we can determine an optimal number of clusters by setting the minimum to 2 and max to 15 clusters.
From the resulting histogram (not run here for computational ), 10 seemed an optimal number of clusters.
```{r, echo=TRUE, eval=FALSE}
nb <- NbClust(cluster_prep$preds, distance = "euclidean",
min.nc = 2, max.nc = 15, method = "kmeans")
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
```
**Cluster**
Now cluster on the preped predictors with 10 centers.
```{r, echo=TRUE}
cluster_it <- function(df_preds, n_centers) {
set.seed(9)
clustered_df_out <- kmeans(x = df_preds$preds, centers = n_centers, trace = FALSE)
clustered_df <- as_tibble(data.frame(
cluster_assignment = factor(clustered_df_out$cluster),
df_preds$outcome, df_preds$preds,
df_preds$df_for_clustering %>% select(abv, ibu, srm)))
return(clustered_df)
}
clustered_beer <- cluster_it(df_preds = cluster_prep, n_centers = 10)
```
Head of the resulting clustered data. Cluster assignment column on the far left.
```{r, echo=TRUE}
kable(clustered_beer[1:20, ])
# How many rows do we have?
nrow(clustered_beer)
```
Join the clustered beer on `beer_ingredients_join`
```{r}
beer_ingredients_join_clustered <- left_join(beer_ingredients_join, clustered_beer,
by = "name")
```
A table of cluster counts broken down by style
```{r}
cluster_table_counts <- table(style = clustered_beer$style_collapsed, cluster = clustered_beer$cluster_assignment)
kable(cluster_table_counts)
```
Plot the clusters. There are 3 axes: ABV, IBU, and SRM, so we choose two at a time.
```{r, echo=TRUE}
clustered_beer_plot_abv_ibu <- ggplot(data = clustered_beer, aes(x = abv, y = ibu, colour = cluster_assignment)) +
geom_jitter() + theme_minimal() +
ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_ibu
clustered_beer_plot_abv_srm <- ggplot(data = clustered_beer, aes(x = abv, y = srm, colour = cluster_assignment)) +
geom_jitter() + theme_minimal() +
ggtitle("k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "SRM") +
labs(colour = "Cluster Assignment")
clustered_beer_plot_abv_srm
```
Now we can add in the style centers (means) for each `style_collapsed` and label it.
```{r, echo=TRUE}
library(ggrepel)
abv_ibu_clusters_vs_style_centers <- ggplot() +
geom_point(data = clustered_beer,
aes(x = abv, y = ibu, colour = cluster_assignment), alpha = 0.5) +
geom_point(data = style_centers,
aes(mean_abv, mean_ibu), colour = "black") +
geom_text_repel(data = style_centers, aes(mean_abv, mean_ibu, label = style_collapsed),
box.padding = unit(0.45, "lines"),
family = "Calibri",
label.size = 0.3) +
ggtitle("Popular Styles vs. k-Means Clustering of Beer by ABV, IBU, SRM") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster Assignment") +
theme_minimal()
abv_ibu_clusters_vs_style_centers
```
The clustering above used a smaller number of clusters (10) than there are `styles_collapsed`. That makes it difficult to determine whether a given style fits snugly into a cluster or not.
**Cluster on just certain selected styles**
We'll take five very distinct collapsed styles and re-run the clustering on beers that fall into these categories.
These styles were intentionally chosen because they are quite distinct: Blonde, IPA, Stout, Tripel, Wheat. Arguably, of these five styles Blondes and Wheats are the closest
```{r, echo=TRUE}
styles_to_keep <- c("Blonde", "India Pale Ale", "Stout", "Tripel", "Wheat")
bt_certain_styles <- beer_totals %>%
filter(
style_collapsed %in% styles_to_keep
) %>%
droplevels()
cluster_on <- c("abv", "ibu", "srm", "total_hops", "total_malt")
to_scale <- c("abv", "ibu", "srm", "total_hops", "total_malt")
response_vars <- c("name", "style", "style_collapsed")
bt_cluster_prep <- prep_clusters(df = bt_certain_styles,
preds = cluster_on,
to_scale = to_scale,
resp = response_vars)
certain_styles_clustered <- cluster_it(df_preds = bt_cluster_prep, n_centers = 5)
style_centers_certain_styles <- style_centers %>%
filter(style_collapsed %in% styles_to_keep)
```
Table of style vs. cluster.
```{r, echo=TRUE}
kable(table(style = certain_styles_clustered$style_collapsed, cluster = certain_styles_clustered$cluster_assignment))
```
Now that we have a manageable number of styles, we can see how well fit each cluster is to each style. If the features we clustered on perfectly predicted style, there would each color (cluster) would be unique to each facet of the plot. (E.g., left entirely blue, second from left entirely green, etc.)
```{r, echo=TRUE}
by_style_plot <- ggplot() +
geom_point(data = certain_styles_clustered,
aes(x = abv, y = ibu,
colour = cluster_assignment), alpha = 0.5) +
facet_grid(. ~ style_collapsed) +
geom_point(data = style_centers_certain_styles,
aes(mean_abv, mean_ibu), colour = "black", shape = 5) +
ggtitle("Selected Styles Cluster Assignment") +
labs(x = "ABV", y = "IBU") +
labs(colour = "Cluster") +
theme_bw()
by_style_plot
```
<!-- ggplot() + -->
<!-- geom_point(data = certain_styles_clustered, -->
<!-- aes(x = abv, y = ibu, -->
<!-- shape = cluster_assignment, -->
<!-- colour = style_collapsed), alpha = 0.5) + -->
<!-- geom_point(data = style_centers_certain_styles, -->
<!-- aes(mean_abv, mean_ibu), colour = "black") + -->
<!-- geom_text_repel(data = style_centers_certain_styles, -->
<!-- aes(mean_abv, mean_ibu, label = style_collapsed), -->
<!-- box.padding = unit(0.45, "lines"), -->
<!-- family = "Calibri", -->
<!-- label.size = 0.3) + -->
<!-- ggtitle("Selected Styles (colors) matched with Cluster Assignments (shapes)") + -->
<!-- labs(x = "ABV", y = "IBU") + -->
<!-- labs(colour = "Style", shape = "Cluster Assignment") + -->
<!-- theme_bw() -->
<!-- ``` -->
## Random asides into hops
**Do more hops always mean more bitterness?**
* It would appear so, from this graph (considering only beer in the most popular styles) and this regression (beta = 2.394418)
```{r, echo=TRUE}
ggplot(data = beer_ingredients_join, aes(total_hops, ibu)) +
geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
geom_smooth(method = lm, se = FALSE, colour = "black") +
ggtitle("Hops Per Beer vs. Bitterness") +
labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
theme_minimal()
```
Regressing total number of hops on bitterness (IBU):
```{r, echo = TRUE}
hops_ibu_lm <- lm(ibu ~ total_hops, data = beer_ingredients_join)
broom::tidy(hops_ibu_lm)
```
This holds even with 5 or more hops.
```{r, echo=TRUE}
ggplot(data = beer_ingredients_join[which(beer_ingredients_join$total_hops >= 5), ], aes(total_hops, ibu)) +
geom_point(aes(total_hops, ibu, colour = style_collapsed)) +
geom_smooth(method = lm, se = FALSE, colour = "black") +
ggtitle("5+ Hops Per Beer vs. Bitterness") +
labs(x = "Number of Hops", y = "IBU", colour = "Style Collapsed") +
theme_minimal()
```
```{r, echo = TRUE}
five_plus_hops_ibu_lm <- lm(ibu ~ total_hops, data = beer_ingredients_join[which(beer_ingredients_join$total_hops > 5), ])
broom::tidy(five_plus_hops_ibu_lm)
```
**Most popular hops**
```{r, echo=TRUE}
# Gather up all the hops columns into one called `hop_name`
beer_necessities_hops_gathered <- beer_necessities %>%
gather(
hop_key, hop_name, hops_name_1:hops_name_13
) %>% as_tibble()
# Filter to just those beers that have at least one hop
beer_necessities_w_hops <- beer_necessities_hops_gathered %>%
filter(!is.na(hop_name)) %>%
filter(!hop_name == "")
beer_necessities_w_hops$hop_name <- factor(beer_necessities_w_hops$hop_name)
# For all hops, find the number of beers they're in as well as those beers' mean IBU and ABV
hops_beer_stats <- beer_necessities_w_hops %>%
ungroup() %>%
group_by(hop_name) %>%
summarise(
mean_ibu = mean(ibu, na.rm = TRUE),
mean_abv = mean(abv, na.rm = TRUE),
n = n()
) %>%
arrange(desc(n))
# Pare to hops that are used in at least 50 beers
pop_hops_beer_stats <- hops_beer_stats[hops_beer_stats$n > 50, ]
pop_hops_display <- pop_hops_beer_stats %>%
rename(
`Hop Name` = hop_name,
`Mean IBU` = mean_ibu,
`Mean ABV` = mean_abv,
`Number Beers Containing this Hop` = n
)
kable(pop_hops_display)
# Keep just beers that contain these most popular hops
beer_necessities_w_popular_hops <- beer_necessities_w_hops %>%
filter(hop_name %in% pop_hops_beer_stats$hop_name) %>%
droplevels()
```
Are there certian hops that are used more often in very high IBU or ABV beers?
Hard to detect a pattern
```{r, echo = TRUE}
ggplot(data = beer_necessities_w_popular_hops) +
geom_point(aes(abv, ibu, colour = hop_name)) +
ggtitle("Beers Containing most Popular Hops") +
labs(x = "ABV", y = "IBU", colour = "Hop Name") +
theme_minimal()
```
```{r, echo=TRUE}
ggplot(data = pop_hops_beer_stats) +
geom_point(aes(mean_abv, mean_ibu, colour = hop_name, size = n)) +
ggtitle("Most Popular Hops' Effect on Alcohol and Bitterness") +
labs(x = "Mean ABV per Hop Type", y = "Mean IBU per Hop Type", colour = "Hop Name",
size = "Number of Beers") +
theme_minimal()
```
# Neural Net
* Can ABV, IBU, and SRM be used in a neural net to predict `style` or `style_collapsed`?
* In the function, specify the dataframe and the outcome, either `style` or `style_collapsed`; the one not specified as `outcome` will be dropped
* The predictor columns will be everything not specified in the vector `predictor_vars`
* The function returns the outcome variable sleected, neural net output, variable importance, the prediction dataframe, predictions, and accuracy
```{r, warning=FALSE, echo=TRUE, eval=TRUE, message=FALSE}
library(nnet)
library(caret)
run_neural_net <- function(df, outcome, predictor_vars) {
out <- list(outcome = outcome)