Skip to content

Let check_nondata_cols allow Vector class (#3835) #3837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 8, 2020
Merged

Let check_nondata_cols allow Vector class (#3835) #3837

merged 5 commits into from
Mar 8, 2020

Conversation

teunbrand
Copy link
Collaborator

This PR comes from the discussion over at #3835.
The change in code does only one thing: it add inherits(x, "Vector") to the allowed data columns check.
This would allow a large number of classes that exhibit vector-like behaviour from Bioconductor to be valid data columns.

I was experimenting around with this, and realised that this PR is necessary but not sufficient to let the S4 classes reach the scale transforming stage of plot building. I'd like to reach that stage of plot building since the scales are extendable and I can work with the S4 classes from there onwards.

What I would additionally like is to drop/replace the as_tibble() in as_gg_data_frame() since it also throws errors with S4 classes. Relevant code here:

ggplot2/R/utilities.r

Lines 379 to 393 in 53edc51

# Check inputs with tibble but allow column vectors (see #2609 and #2374)
as_gg_data_frame <- function(x) {
x <- lapply(x, validate_column_vec)
new_data_frame(tibble::as_tibble(x))
}
validate_column_vec <- function(x) {
if (is_column_vec(x)) {
dim(x) <- NULL
}
x
}
is_column_vec <- function(x) {
dims <- dim(x)
length(dims) == 2L && dims[[2]] == 1L
}

I ran the unit tests with the code changed from new_data_frame(tibble::as_tibble(x)) to new_data_frame(x) and this both didn't seem to break any tests and seemed sufficient for the S4 classes to reach the scales. Disclaimer: I couldn't install the sf package, so these tests were skipped. Relevant issues to consider should be #2609 and #2374.

Thanks for reading and considering!

@clauswilke
Copy link
Member

The as_tibble() call has bothered me for a while, because we're outsourcing our data validation to an external package, which may have a different agenda. If a check is needed there, I think it should also be based on the check_nondata_cols() function or a variation thereof. (I think in the code base, the as_tibble() call was there before check_nondata_cols() was added.)

However, it would be good to get input from @thomasp85 and/or @yutannihilation. There may be something I'm missing.

@teunbrand
Copy link
Collaborator Author

I've meanwhile managed to also run the tests with the sf package, which don't seem to be affected.

@yutannihilation
Copy link
Member

If a check is needed there, I think it should also be based on the check_nondata_cols() function or a variation thereof. (I think in the code base, the as_tibble() call was there before check_nondata_cols() was added.)

Sounds good to me! I've been wondering how we can replace as_tibble() for a long time ( #3018 (comment))...

@teunbrand
Copy link
Collaborator Author

teunbrand commented Feb 29, 2020

I might not be particularly aware what function the as_tibble call is. The check that tibble does that hinders the S4 classes is:

print(tibble:::is_1d_or_2d) # is a check in tibble:::check_valid_cols
#> function (x) 
#> {
#>     (is_vector(x) && !needs_dim(x)) || is.data.frame(x) || is.matrix(x)
#> }
#> <bytecode: 0x4272488>
#> <environment: namespace:tibble>

I tried to explore what behaviour would be different if we'd remove the as_tibble, and found two cases that would act different.

# heads up: this data.frame does not print prettily
df <- ggplot2:::new_data_frame(list(x = matrix(1:9, 3)))

ggplot(df, aes(col(x), row(x), fill = x)) +
  geom_tile()

The above would throw an error with as_tibble() and yields a normal plot without it (since the validate_column_vec() function already linearises the matrix).

Also the following case:

df <- data.frame(x = 1)
# Manual assignment because data.frame constructor automatically converts to POSIXct 
df$y <- as.POSIXlt(Sys.time(), "GMT")

ggplot(df, aes(x, y)) +
  geom_point()

Returns this error with as_tibble():

Error: Column y is a date/time and must be stored as POSIXct, not POSIXlt.

And this error without as_tibble():

Error: Invalid input: time_trans works with objects of class POSIXct only

I think the matrix case is actually kind of convenient and the error messages about POSIXlt are both indicating what class is causing trouble. Would it be acceptable to remove the as_tibble() call in as_gg_data.frame()?

@clauswilke
Copy link
Member

Let's assess the need for as_tibble() systematically. I can find only two cases where as_gg_data_frame() is used. First in Geom$use_defaults():

ggplot2/R/geom-.r

Lines 120 to 124 in 660aad2

if (empty(data)) {
data <- as_gg_data_frame(missing_eval)
} else {
data[names(missing_eval)] <- missing_eval
}

Second in Layer$compute_aesthetics():

ggplot2/R/layer.r

Lines 265 to 268 in 660aad2

evaled <- lapply(evaled, unname)
evaled <- as_gg_data_frame(evaled)
evaled <- add_group(evaled)
evaled

I think the as_tibble() check isn't needed in either. In Geom$use_defaults(), the check is run on the default aesthetic values of geoms, and I think it's reasonable to assume that they are valid quantities. Also, the else branch in the same code doesn't run the check. In Layer$compute_aesthetics(), we already have the check for non-data columns, so I don't think we need to check again:

ggplot2/R/layer.r

Lines 237 to 246 in 660aad2

# Check aesthetic values
nondata_cols <- check_nondata_cols(evaled)
if (length(nondata_cols) > 0) {
msg <- paste0(
"Aesthetics must be valid data columns. Problematic aesthetic(s): ",
paste0(vapply(nondata_cols, function(x) {paste0(x, " = ", as_label(aesthetics[[x]]))}, character(1)), collapse = ", "),
". \nDid you mistype the name of a data column or forget to add after_stat()?"
)
abort(msg)
}

Am I missing a use of this function?

@thomasp85 Do you want to chime in? You wrote the as_tibble() line originally.

@clauswilke
Copy link
Member

While we're at it, I also found this, which is essentially equivalent to a call to as_gg_data_frame():

ggplot2/R/facet-.r

Lines 420 to 423 in 660aad2

eval_facets <- function(facets, data, env = globalenv()) {
vars <- compact(lapply(facets, eval_facet, data, env = env))
new_data_frame(tibble::as_tibble(vars))
}

@teunbrand
Copy link
Collaborator Author

Thank you very much Claus for assessing as_tibble across the entirety of ggplot. I've run the unit tests without the as_tibble in facet-.r too and it appears from my point of view that it wouldn't break anything.

I'm going to go ahead with removing the as_tibble() from as_gg_data_frame() in this PR; it doesn't really make sense to merge this when the S4 classes will throw errors a few lines of code later. It would be really cool for me if this would be merged, but if there are decent reasons to keep as_tibble() that can't easily be overcome, I'll understand.

@clauswilke
Copy link
Member

Yes, this looks good to me. I think the code in facet-.r needs to be handled separately, since we'd probably want an input check there.

@clauswilke
Copy link
Member

@thomasp85 Unless you have strong objections, I'd say let's go ahead with this. It seems the worst case scenario is a less informative error message in some strange corner case, and if that arises we can deal with it then.

@yutannihilation
Copy link
Member

Don't we need a NEWS bullet for this?

@clauswilke
Copy link
Member

I guess a news item would be good, so people are more broadly aware that this now works.

@teunbrand Could you add a 1-2 sentence news item in NEWS.md?

@teunbrand
Copy link
Collaborator Author

Sure thing. I didn't know how to do that without merging in the current master first, but I hope it is correct. Thanks everyone for the help!

@clauswilke
Copy link
Member

Merging in the current master is correct. However, I'd like to ask you to rephrase the news item. Usually we try to not focus on the technical details but instead write something that makes sense to the user. So in this case, I think it should be something like "ggplot2 now accepts data columns of type Vector, which are widely used in the Bioconductor project." The as_tibble() is a technical detail that I wouldn't put into the changelog at all.

@teunbrand
Copy link
Collaborator Author

Yes that indeed sounds more informative. Thanks!

@clauswilke clauswilke merged commit db58364 into tidyverse:master Mar 8, 2020
@clauswilke
Copy link
Member

Thanks!

@teunbrand teunbrand deleted the lenient_nondata_cols branch March 8, 2020 18:54
sthagen added a commit to sthagen/tidyverse-ggplot2 that referenced this pull request Mar 9, 2020
Let check_nondata_cols allow Vector class (tidyverse#3835) (tidyverse#3837)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants