Description
Hi,
When using fct_reorder()
in presence of missing values, you often do not get the expected result.
For instance, in the following code, the "blue" level gets an NA
summary and is therefore sent to the last level of the result. In larger datasets, where missing values happen everywhere, this results in fct_reorder()
doing frustratingly nothing.
library(tidyverse)
df = tribble(
~color, ~a,
"purple", 1,
"purple", 2,
"blue", 3, #NA
"blue", 4,
"green", 5,
"green", 6
)
df$color = factor(df$color)
df$color %>% levels
#> [1] "blue" "green" "purple"
fct_reorder(df$color, df$a) %>% levels
#> [1] "purple" "blue" "green"
df$a[3]=NA
fct_reorder(df$color, df$a) %>% levels
#> [1] "purple" "green" "blue"
fct_reorder(df$color, df$a, na.rm=TRUE) %>% levels
#> [1] "purple" "blue" "green"
Created on 2022-08-10 by the reprex package (v2.0.1)
This is especially unexpected as the default function, median()
, has na.rm=FALSE
by default. Using other common summary functions like min()
and max()
has the same problem.
There is a mention of this in the documentation (... Other arguments passed on to .fun. A common argument is na.rm = TRUE.
), but I don't think this is explicit enough.
Could there be some kind of warning to suggest we add na.rm=TRUE
? For instance if(any(is.na(summary))) warn("missing")
.
Otherwise, maybe this should be mentioned up in the description, for instance something like "Any missing value returned by the summary function for a level will cause this level to be sent to the end." (ok that's not well written but you get the point)
You might even want the user to explicitly opt-in for na.rm=FALSE
, and by default inject na.rm=TRUE
to the summary function if na.rm
is in formals(.fun).
. This is a bit invasive, I'll give you that, but I cannot see any real use case where na.rm=FALSE
could be wanted.