-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add bounds
argument to geom_density()
#4013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Examples of usage: library(ggplot2)
library(tibble)
set.seed(101)
ggplot(tibble(x = runif(100)), aes(x)) +
geom_density() +
geom_density(bounds = c(0, 1), color = "blue") +
stat_function(fun = dunif, color = "red") ggplot(tibble(x = rexp(100)), aes(x)) +
geom_density() +
geom_density(bounds = c(0, Inf), color = "blue") +
stat_function(fun = dexp, color = "red") Created on 2020-05-19 by the reprex package (v0.3.0) |
@clauswilke do you have time to take a look at this - I don't have any problems with the functionality but I'm quite rusty on density estimation code |
This is not pressing btw - it will not be part of the patch release |
Sure, assigned to myself. @echasnovski I will review within the next few weeks or so. If I don't, please ping me to remind me. |
|
Apologies, please ping again after May 10. I have two more weeks of teaching and it's using up nearly all my time. |
Sure, no worries :) |
@clauswilke, how about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks reasonable. But please add some comments to document the logic, and also explain the logic in the exported documentation.
R/stat-density.r
Outdated
@@ -122,8 +129,15 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1, | |||
), n = 1)) | |||
} | |||
|
|||
dens <- stats::density(x, weights = w, bw = bw, adjust = adjust, | |||
kernel = kernel, n = n, from = from, to = to) | |||
if (all(is.infinite(bounds))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some comments here explaining the thought process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
R/stat-density.r
Outdated
@@ -15,6 +15,8 @@ | |||
#' not line-up, and hence you won't be able to stack density values. | |||
#' This parameter only matters if you are displaying multiple densities in | |||
#' one plot or if you are manually adjusting the scale limits. | |||
#' @param bounds Known lower and upper bounds for estimated data. Default | |||
#' `c(-Inf, Inf)` means that there are no (finite) bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be some explanation somewhere in the documentation of what happens when the bounds are finite. And what happens if data points fall outside the set bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was more of an open question to package authors. What do you think is the best behavior here? It mostly depends on the implied severity of bounds argument.
Currently no check is done. It estimates density and then tries to "reflect" it at edges. There are some extreme cases (like when bounds and data range not intersect at all), which I decided to tackle after the decision about bounds severity.
I can suggest at least three options:
- Give a warning when at least one point is outside of bounds. Proceed as nothing happened. Implies that bounds is essentially for visualization purpose.
- Remove points outside of bounds before or after estimating density. This implies that bounds is a data generation assumption and tries to fix it.
- Give an error if at least one point is outside. This also implies data generation but stops early. I personally like this the most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documented about what is essentially done if any of bounds
is finite.
The severity of bounds
effect is still a question to you.
With this PR, would the |
As I understand it, Proposed |
@clauswilke are you interested in finishing this off with @echasnovski ? |
I have already forgotten the details of it :( |
Same here. Forgotten the details, but I think it's reasonable to revive. |
8b9734d
to
9c5eeef
Compare
I addressed previous comments. All other work seems to be better done after consensus about how severe |
@clauswilke I'll let you decide on the final points for this PR |
@echasnovski Sorry, I missed your update from May 12. Please always make sure to tag me, as I don't currently have the time to read every ggplot2 notification that comes my way but I try to pay attention to cases where I'm tagged. Regarding bounds, I'm not generally in favor of any approach that errors out when it doesn't have to. Sometimes users want to do weird stuff for valid reasons, and we shouldn't prevent them from doing so. I think issuing a warning but otherwise performing a reasonable action is the way to go. So maybe remove points outside of bounds and warn that this has happened? This would be similar to the treatment of limits in scales functions. |
Hi, @clauswilke! I updated this to remove data points outside of I imagine, some tests should be done, but I didn't find any present substantial tests for either |
@echasnovski Yes, some tests would be good. Just because there is no test currently for Also test that no adjustments are made when no boundaries are set. You probably also have to merge in the current main branch, specifically to get the tests working again. There is a failure currently that I don't think is caused by your code. @thomasp85 I'll be traveling through the remainder of July, with limited internet access. I may not be able to respond to developments in this PR. You're welcome to take over if necessary. I'll be back in August. |
d239de8
to
5a625f7
Compare
@clauswilke, this showed to be not trivial. The reason is that Other than that, I added snapshot tests for different Also I didn't manage to incorporate weights into density estimation. Here is what I got: ggplot(mtcars, aes(mpg, weight = cyl)) + stat_density()
Warning message:
The following aesthetics were dropped during statistical transformation: weight
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
Not entirely sure what you mean here. No boundary correction with default |
tests/testthat/test-stat-density.R
Outdated
expect_snapshot(make_density(c(-Inf, Inf))) | ||
expect_snapshot(make_density(c(mpg_min, Inf))) | ||
expect_snapshot(make_density(c(-Inf, mpg_max))) | ||
expect_snapshot(make_density(c(mpg_min, mpg_max))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that you're using snapshots here? Couldn't you just test numerically that density is approximately twice higher at finite boundary? It doesn't have to be that accurate. You could just test that it's > 1.9x and < 2.1x, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clauswilke, Snapshots seem to provide more full testing of the feature. This is not only about being twice as higher at finite boundary, but about a specific way of dealing with boundary correction (via density reflection). I can make snapshots smaller by setting smaller n
.
All in all, I'd keep them. Will do as you ask, of course. So should I not use snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like snapshots because if they break at some point in the future nobody will know what the correct result should have been. If you want to test the full reflection, then the right way to do this is to calculate the reflection again in the test and compare numerically. Yes, it may seem silly to run the same calculation in two locations and compare they're the same, but it actually is a good check for the correctness of the implementation in the main library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated test to not use snapshots.
I think your tests are good. I made some specific comments regarding one. I'm not a big fan of snapshot tests and believe they should be avoided whenever possible. Regarding weights, did you check whether the computation was correct? I see no reason that it shouldn't. The warning occurs because I missed to add the following line in my recent PR, could you add it to Line 177 in a979ffd
|
Yep, you are right. |
Yes, please keep |
|
This looks good to me now. Only one last thing: Please add a news item to the top of: https://github.com/tidyverse/ggplot2/blob/main/NEWS.md |
@clauswilke, done. |
@thomasp85 This looks good to me. Do you want to give it a quick look also before we merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@echasnovski Thanks! |
@clauswilke, @thomasp85, thank you. The pleasure is mine to finally give back to the R package for visualization. |
This is a PR for #3387 and serves as a place for discussion and feedback about possible change.
The goal is to add
bounds
argument togeom_density()
andstat_density()
, which will allow users to specify known constraints on data to obtain more "realistic" density estimation. Change to computed output ofstats::density()
is made using relatively simple "reflection method of boundary correction" roughly taken from "Simple boundary correction for kernel density estimation” by M.C. Jones, 1993 (you can look at free pdf-file of presentation based on this article, in particular, page 7).Currently this PR serves two purposes:
If agreed that this is a desirable change, at least the following should also be made (which I am happy to do):
bounds
interval, or simply it doesn't represent a valid interval.bounds
usage.