Skip to content

Weighted density #5254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@
* `scale_{x/y}_discrete()` can now accept a `sec.axis`. It is recommended to
only use `dup_axis()` to set custom breaks or labels, as discrete variables
cannot be transformed (@teunbrand, #3171).
* `stat_density()` has the new computed variable: `wdensity`, which is
calculated as the density times the sum of weights (@teunbrand, #4176).

# ggplot2 3.5.1

Expand Down
14 changes: 10 additions & 4 deletions R/stat-density.R
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
#' @eval rd_computed_vars(
#' density = "density estimate.",
#' count = "density * number of points - useful for stacked density plots.",
#' wdensity = "density * sum of weights. In absence of weights, the same as
#' `count`.",
#' scaled = "density estimate, scaled to maximum of 1.",
#' n = "number of points.",
#' ndensity = "alias for `scaled`, to mirror the syntax of [`stat_bin()`]."
Expand Down Expand Up @@ -113,17 +115,19 @@ StatDensity <- ggproto("StatDensity", Stat,
compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
kernel = "gaussian", n = 512,
bounds = c(-Inf, Inf)) {
nx <- length(x)
nx <- w_sum <- length(x)
if (is.null(w)) {
w <- rep(1 / nx, nx)
} else {
w <- w / sum(w)
w_sum <- sum(w)
w <- w / w_sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function fit_data_to_bounds() checks for division by zero but here you're not doing that. I think in weird corner cases with zero weights this could give you different results depending on whether a point is inside the bounds or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to come up with examples that should be valid, but aren't, but I'm not having an easy time of it. If you censor a non-zero weight and are left with all zero weights, the weighted density should reflect that and show a flat line. Currently, it does this, but also passes on a warning from stats::density():

devtools::load_all("~/packages/ggplot2/")
#> ℹ Loading ggplot2
df <- data.frame(x = 1:3)

ggplot(df, aes(x)) +
  geom_density(aes(weight = c(1, 0, 0), y = after_stat(wdensity)),
               bounds = c(1.1, Inf))
#> Warning: Some data points are outside of `bounds`. Removing them.
#> Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel =
#> kernel, : sum(weights) != 1 -- will not get true density

Created on 2023-04-11 with reprex v2.0.2

I'm wondering if there is something to be improved here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. As far as I'm concerned you can handle it however you see fit, I just wanted to call out the inconsistency.

}

# Adjust data points and weights to all fit inside bounds
sample_data <- fit_data_to_bounds(bounds, x, w)
x <- sample_data$x
w <- sample_data$w
w_sum <- sample_data$w_sum * w_sum
nx <- length(x)

# if less than 2 points return data frame of NAs and a warning
Expand All @@ -135,6 +139,7 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
scaled = NA_real_,
ndensity = NA_real_,
count = NA_real_,
wdensity = NA_real_,
n = NA_integer_,
.size = 1
))
Expand All @@ -158,6 +163,7 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
scaled = dens$y / max(dens$y, na.rm = TRUE),
ndensity = dens$y / max(dens$y, na.rm = TRUE),
count = dens$y * nx,
wdensity = dens$y * w_sum,
n = nx,
.size = length(dens$x)
)
Expand All @@ -166,7 +172,7 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
# Check if all data points are inside bounds. If not, warn and remove them.
fit_data_to_bounds <- function(bounds, x, w) {
is_inside_bounds <- (bounds[1] <= x) & (x <= bounds[2])

w_sum <- 1
if (!all(is_inside_bounds)) {
cli::cli_warn("Some data points are outside of `bounds`. Removing them.")
x <- x[is_inside_bounds]
Expand All @@ -177,7 +183,7 @@ fit_data_to_bounds <- function(bounds, x, w) {
}
}

return(list(x = x, w = w))
return(list(x = x, w = w, w_sum = w_sum))
}

# Update density estimation to mitigate boundary effect at known `bounds`:
Expand Down
1 change: 1 addition & 0 deletions man/geom_density.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 13 additions & 5 deletions tests/testthat/test-stat-density.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,28 @@ test_that("stat_density actually computes density", {

test_that("stat_density can make weighted density estimation", {
df <- mtcars
df$weight <- mtcars$cyl / sum(mtcars$cyl)
df$weight <- mtcars$cyl

dens <- stats::density(df$mpg, weights = df$weight, bw = bw.nrd0(df$mpg))
dens <- stats::density(
df$mpg, weights = df$weight / sum(df$weight),
bw = bw.nrd0(df$mpg)
)
expected_density_fun <- stats::approxfun(data.frame(x = dens$x, y = dens$y))

plot <- ggplot(df, aes(mpg, weight = weight)) + stat_density()
actual_density_fun <- stats::approxfun(layer_data(plot)[, c("x", "y")])
plot <- layer_data(ggplot(df, aes(mpg, weight = weight)) + stat_density())
actual_density_fun <- stats::approxfun(plot[, c("x", "y")])

test_sample <- unique(df$mpg)
expect_equal(
expected_density_fun(test_sample),
actual_density_fun(test_sample),
tolerance = 1e-3
)

expect_equal(
plot$wdensity,
plot$density * sum(mtcars$cyl)
)
})

test_that("stat_density uses `bounds`", {
Expand Down Expand Up @@ -115,7 +123,7 @@ test_that("compute_density returns useful df and throws warning when <2 values",
expect_warning(dens <- compute_density(1, NULL, from = 0, to = 0))

expect_equal(nrow(dens), 1)
expect_equal(names(dens), c("x", "density", "scaled", "ndensity", "count", "n"))
expect_equal(names(dens), c("x", "density", "scaled", "ndensity", "count", "wdensity", "n"))
expect_type(dens$x, "double")
})

Expand Down
Loading