You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/classification.Rmd
+17-20
Original file line number
Diff line number
Diff line change
@@ -14,15 +14,12 @@ knitr::opts_chunk$set(
14
14
)
15
15
```
16
16
17
-
In this article, we'll use the stacks package to predict the island that penguins come from using a stacked ensemble on the `palmerpenguins` data. This vignette assumes that you're familiar with tidymodels "proper," as well as the basic grammar of the package, and have seen it implemented on numeric data; if this is not the case, check out the "Getting Started With stacks" vignette!
18
-
19
-
The package is closely integrated with the rest of the functionality in tidymodels—we'll load those packages as well, in addition to a few tidyverse packages to evaluate our results later on.
17
+
In this vignette, we'll tackle a multiclass classification problem using the stacks package. This vignette assumes that you're familiar with tidymodels "proper," as well as the basic grammar of the package, and have seen it implemented on numeric data; if this is not the case, check out the "Getting Started With stacks" vignette!
20
18
21
19
```{r setup, eval = FALSE}
22
20
library(tidymodels)
21
+
library(tidyverse)
23
22
library(stacks)
24
-
library(purrr)
25
-
library(dplyr)
26
23
```
27
24
28
25
```{r packages, include = FALSE}
@@ -35,31 +32,35 @@ library(yardstick)
35
32
library(stacks)
36
33
library(purrr)
37
34
library(dplyr)
35
+
library(tidyr)
38
36
```
39
37
40
-
We'll make use of the `palmerpenguins::penguins`data, giving measurements taken from three different species of penguins from three different antarctic islands! We'll be predicting penguins species using the rest of the predictors in the data.
38
+
Allison Horst's `palmerpenguins` package contains data giving measurements taken from three different species of penguins from three different islands in Antarctica. This study was carried out across three years—we might suspect that weather conditions may play a role in penguin migration and, to some extent, morphology (e.g. body mass). In this article, we'll use the stacks package to predict the year that these measurements were taken in using a stacked ensemble on the `palmerpenguins`data.
41
39
42
40
```{r, message = FALSE, warning = FALSE}
43
41
library(palmerpenguins)
44
42
data("penguins")
45
43
46
44
str(penguins)
47
45
48
-
penguins <- penguins[!is.na(penguins$sex),]
46
+
penguins <-
47
+
penguins %>%
48
+
drop_na(sex) %>%
49
+
mutate(year = as.factor(year))
49
50
```
50
51
51
-
Let's plot the data to get a sense for how separable these three island groups are.
52
+
Let's plot the data to get a sense for how separable these three years groups are.
52
53
53
54
```{r, message = FALSE, warning = FALSE}
54
55
library(ggplot2)
55
56
56
57
ggplot(penguins) +
57
-
aes(x = bill_length_mm, y = bill_depth_mm, color = island) +
58
+
aes(x = bill_length_mm, y = bill_depth_mm, color = year) +
58
59
geom_point() +
59
-
labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", col = "island")
60
+
labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", col = "Year")
60
61
```
61
62
62
-
Just with these two predictors, it seems like we can already start to separate these islands decently well! Let's see how well the stacked ensemble can classify these penguins.
63
+
Just with these two predictors, it seems like this might be a tough problem to solve! Let's see how well the stacked ensemble can classify these penguins.
Note that we now use the ROC AUC metric rather than root mean squared error (as in the numeric response setting)—any yardstick metric with classification functionality would work here.
91
-
92
89
We also need to use the same control settings as in the numeric response setting:
93
90
94
91
```{r}
95
92
ctrl_grid <- control_stack_grid()
96
93
```
97
94
98
-
We'll define two different model definitions to try to predict island—a random forest and a neural network.
95
+
We'll define two different model definitions to try to predict year—a random forest and a neural network.
99
96
100
97
Starting out with a random forest:
101
98
@@ -177,7 +174,7 @@ Computing the ROC AUC for the model:
177
174
```{r, eval = FALSE}
178
175
yardstick::roc_auc(
179
176
penguins_pred,
180
-
truth = island,
177
+
truth = year,
181
178
contains(".pred_")
182
179
)
183
180
```
@@ -187,7 +184,7 @@ Looks like our predictions were pretty strong! How do the stacks predictions per
0 commit comments