Description
Currently geom_density()
directly uses density()
to compute kernel density estimation. Due to its nature, it has "extending property": output density can imply that values outside the range of input data are possible (with default "gaussian" kernel). This is a realistic practical setup, but there are cases when this is not true and data is bounded (for example, when only positive values are possible). It can be a good idea to support kernel density estimation on this type of bounded data with new bounds
argument of geom_density()
.
In my opinion, one of the methods most easiest to implement, understand, and teach, is "reflection" method. There is one possible description on this page. Basically, boundary correction is done by doing "standard" kernel density estimation first, and then "reflecting" tails outside of desired interval to be inside. Densities inside and outside of desired interval are added together in "symmetric fashion": d(x) = d_f(x) + d_f(l - (x-l)) + d_f(r + (r-x))
, where d_f
is density of input, d
is density of output, l
and r
are left and right edges of desired interval.
I made some quick and dirty changes to ggplot2 for demonstration. stat_density()
gets bounds
argument with default value of c(-Inf, Inf)
. Here are some examples of proposed functionality:
library(tibble)
set.seed(101)
ggplot(tibble(x = runif(100)), aes(x)) +
geom_density() +
geom_density(bounds = c(0, 1), color = "blue", ) +
stat_function(data = tibble(x = c(0, 1)), fun = dunif, color = "red")
ggplot(tibble(x = rexp(100)), aes(x)) +
geom_density() +
geom_density(bounds = c(0, Inf), color = "blue") +
stat_function(data = tibble(x = c(0, 5)), fun = dexp, color = "red")