Description
predict(type = "prob")
and predict(type = "class")
result in the same column names if the outcome has a level named "class"
.
library(parsnip)
library(tibble)
x <- tibble(
class = factor(sample(c("class", "class_1"), 100, replace = TRUE)),
a = rnorm(100),
b = rnorm(100)
)
mod <- logistic_reg() %>%
set_mode(mode = "classification") %>%
fit(class ~ a + b, data = x)
predict(mod, type = "class", new_data = x)
#> # A tibble: 100 × 1
#> .pred_class
#> <fct>
#> 1 class_1
#> 2 class_1
#> 3 class
#> 4 class_1
#> 5 class_1
#> 6 class
#> 7 class
#> 8 class
#> 9 class
#> 10 class
#> # … with 90 more rows
predict(mod, type = "prob", new_data = x)
#> # A tibble: 100 × 2
#> .pred_class .pred_class_1
#> <dbl> <dbl>
#> 1 0.498 0.502
#> 2 0.475 0.525
#> 3 0.556 0.444
#> 4 0.457 0.543
#> 5 0.490 0.510
#> 6 0.520 0.480
#> 7 0.516 0.484
#> 8 0.525 0.475
#> 9 0.550 0.450
#> 10 0.562 0.438
#> # … with 90 more rows
Created on 2022-05-09 by the reprex package (v2.0.1)
Some packages downstream from parsnip join these two tibbles together, resulting in issues like tidymodels/stacks#125 and tidymodels/tune#487.
@DavisVaughan and I spent some time with this this morning, and came to the conclusion that erroring in predict(type = "prob")
when an outcome level is named "class"
is likely the best route here. Erroring in parsnip, before the predictions are generated, means that downstream packages (tune, stacks, possibly elsewhere) need not anticipate this edge case when joining predictions. This also gives us a chance to raise the same (informative) error any time this issue comes up.
This solution doesn't feel very satisfying. Some alternatives:
- changing the column name at
predict(type = "prob")
in this case, e.g. generating.pred_class___
- handling these edge cases later on, a la handle
"class"
-like outcome names and levels stacks#126
These didn't sound very satisfying either.🤷