Skip to content

Weighted density #5254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 20, 2024
Merged

Weighted density #5254

merged 4 commits into from
May 20, 2024

Conversation

teunbrand
Copy link
Collaborator

This PR aims to fix #4176.

It adds the wdensity computed variable, which is the density times the sum of weights.
If weights are absent, this reduces to the same as the count computed variable.

An example:

devtools::load_all("~/packages/ggplot2/")
#> ℹ Loading ggplot2

set.seed(42)
df <- data.frame(x = rnorm(100), y = abs(rnorm(100)), type = c(1, 5))

p <- ggplot(df, aes(x, colour = factor(type)))

# Weights are pointless here because they are by group
p + geom_density(aes(weight = type, y = after_stat(count)))

# Less pointless weights
p + geom_density(aes(weight = type, y = after_stat(wdensity)))

Created on 2023-03-30 with reprex v2.0.2

@thomasp85 thomasp85 requested a review from clauswilke April 11, 2023 07:31
@thomasp85
Copy link
Member

I'm fine with this, but looping in @clauswilke as he seemed to have more opinions on design in the issue thread

Copy link
Member

@clauswilke clauswilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two relatively minor comments (see code annotations) and one bigger one. The bigger one is: Have you checked how the weights interact with the density reflection when there are bounds imposed? Is there anything that needs to be done or are things behaving as expected. I'm not sure I remember what the density reflection does, so it may well be that everything is correct as is, just wanted to be certain.

if (is.null(w)) {
w <- rep(1 / nx, nx)
} else {
w <- w / sum(w)
w_sum <- sum(w)
w <- w / w_sum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function fit_data_to_bounds() checks for division by zero but here you're not doing that. I think in weird corner cases with zero weights this could give you different results depending on whether a point is inside the bounds or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to come up with examples that should be valid, but aren't, but I'm not having an easy time of it. If you censor a non-zero weight and are left with all zero weights, the weighted density should reflect that and show a flat line. Currently, it does this, but also passes on a warning from stats::density():

devtools::load_all("~/packages/ggplot2/")
#> ℹ Loading ggplot2
df <- data.frame(x = 1:3)

ggplot(df, aes(x)) +
  geom_density(aes(weight = c(1, 0, 0), y = after_stat(wdensity)),
               bounds = c(1.1, Inf))
#> Warning: Some data points are outside of `bounds`. Removing them.
#> Warning in density.default(x, weights = w, bw = bw, adjust = adjust, kernel =
#> kernel, : sum(weights) != 1 -- will not get true density

Created on 2023-04-11 with reprex v2.0.2

I'm wondering if there is something to be improved here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. As far as I'm concerned you can handle it however you see fit, I just wanted to call out the inconsistency.

@teunbrand
Copy link
Collaborator Author

Boundary inflection should work as intended. In the inflected layer (solid line), the two peaks should have equal height because the inflected left bit has half the weight that gets doubled through inflection. This works similar to regular after_stat(density) but the between-group difference in weight should be apparent by the reddish line running half as high as the blueish line.

devtools::load_all("~/packages/ggplot2/")
#> ℹ Loading ggplot2

df <- data.frame(x = c(1, 5, 1, 5),
                 group = c("A", "A", "B", "B"),
                 weight = c(1, 2, 2, 4))

ggplot(df, aes(x, colour = group)) +
  geom_density(aes(weight = weight, y = after_stat(wdensity)),
               bounds = c(1, Inf)) +
  geom_density(aes(weight = weight, y = after_stat(wdensity)),
               linetype = "dotted") +
  xlim(-2, 8)

Created on 2023-04-11 with reprex v2.0.2

Copy link
Member

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Merge branch 'main' into density_wcount

# Conflicts:
#	R/stat-density.R
#	tests/testthat/test-stat-density.R
@teunbrand teunbrand merged commit 1df93c4 into tidyverse:main May 20, 2024
11 checks passed
@teunbrand teunbrand deleted the density_wcount branch May 20, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In stat_density(), the calculated count aesthetic should take weights into account
3 participants