Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot print a data.table with column of type rvar from posterior package #5981

Open
SteveBronder opened this issue Mar 5, 2024 · 6 comments
Labels
non-atomic column e.g. list columns, S4 vector columns

Comments

@SteveBronder
Copy link

SteveBronder commented Mar 5, 2024

Summary

When printing a data.table that contains an rvar type from the posterior package an empty data.table is printed instead. See the mrp below for a full case. Looking into it more it appears dim.data.table returns a value of zero for the rows where there should be a nonzero value. dim returning a zero causes line 55 of print.data.table to go to the empty data.table message.

I tried looking into the code for dim:

data.table/src/wrappers.c

Lines 101 to 121 in 46ee525

SEXP dim(SEXP x)
{
// fast implementation of dim.data.table
if (TYPEOF(x) != VECSXP) {
error(_("dim.data.table expects a data.table as input (which is a list), but seems to be of type %s"),
type2char(TYPEOF(x)));
}
SEXP ans = PROTECT(allocVector(INTSXP, 2));
if(length(x) == 0) {
INTEGER(ans)[0] = 0;
INTEGER(ans)[1] = 0;
}
else {
INTEGER(ans)[0] = length(VECTOR_ELT(x, 0));
INTEGER(ans)[1] = length(x);
}
UNPROTECT(1);
return ans;
}

The data.table has a length(x) equal to 1 so the last branch is chosen.

Calling dim.data.frame with the data.table gives the correct output below of (4, 1). I think this is because .row_names_info used in dim.data.frame is just asking for the length of the row names via .Internal(shortRowNames(x, type)).

Minimal reproducible example

library(data.table)
library(posterior)
n <- 4   # length of output vector
x_rvar <- rvar(array(rnorm(n*n, mean = 1, sd = 1), dim = c(n, n)))
x_rvar
# rvar<4000>[4] mean ± sd:
#   [1] 0.98 ± 1.00  1.00 ± 1.02  1.00 ± 0.99  0.99 ± 0.99 

# does not work :(
ex_dt = data.table(ex = x_rvar)
ex_dt
# Empty data.table (0 rows and 1 cols): ex

ex_df = data.frame(ex = x_rvar)

# dim for data.frame has wrong rows
dim(ex_dt)
# [1] 0 1
dim(ex_df)
# [1] 4 1

dim.data.frame(ex_dt)
# [1] 4 1

# But `ex` does exist?
ex_dt[, ex]
# rvar<4>[4] mean ± sd:
# [1] 0.025 ± 0.58  0.759 ± 0.69  1.292 ± 1.14  1.601 ± 0.53

ex_dt[, mean(ex)]
[1] 0.02508674 0.75898825 1.29194188 1.60076280

# printing with print.data.frame works
print.data.frame(ex_dt)
# ex
# 1 0.025 ± 0.58
# 2 0.759 ± 0.69
# 3 1.292 ± 1.14
# 4 1.601 ± 0.53

dput(ex_dt)
# structure(list(ex = 
#  structure(list(), draws = structure(
#   c(-0.228122437024597, -0.665473091749799, 0.433053985537569, 0.56088850510324,  
#      1.57671830971838,  -0.104841641013198, 0.817921294855514, 0.746155054023318, 
#      0.778298953156937,  2.20224060437554, -0.078692684532738, 2.26592066385111,
#      1.21605524439755,   1.59716805846671,  2.34552924058062,  1.24429863711927), 
#      dim = c(4L, 4L), dimnames = list(c("1", "2", "3", "4"), NULL)), 
#      nchains = 1L, cache = <environment>, class = c("rvar",  "vctrs_vctr"))),
#   row.names = c(NA, -4L), class = c("data.table", "data.frame"), 
#   .internal.selfref = <pointer: 0x55cc4ae60f60>)

rownames(ex_dt)
# [1] "1" "2" "3" "4"

Looking over the open/closed issues for data.table I could not find anything similar. I don't know enough about R's internals to know why length(VECTOR_ELT(x, 0)) is giving a value of zero, though I found the source for VECTOR_ELThere

# Output of sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Rocky Linux 8.9 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: FlexiBLAS /MNT/SW/NIX/STORE/LKQ7SR37F820WRZCLZ0VC9N6FG5ZB3GD-OPENBLAS-0.3.26/LIB/LIBOPENBLAS.SO;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] posterior_1.5.0   data.table_1.15.2 cmdstanr_0.7.1    testthat_3.2.1   

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5          cli_3.6.2            knitr_1.42           rlang_1.1.3          xfun_0.39            processx_3.8.3       pkgload_1.3.4        generics_0.1.3       tensorA_0.36.2.1    
[10] jsonlite_1.8.8       glue_1.7.0           backports_1.4.1      rprojroot_2.0.3      distributional_0.4.0 ps_1.7.6             brio_1.1.3           fansi_1.0.6          tibble_3.2.1        
[19] abind_1.4-5          lifecycle_1.0.4      compiler_4.3.2       dplyr_1.1.2          waldo_0.5.2          pkgconfig_2.0.3      rstudioapi_0.14      R6_2.5.1             tidyselect_1.2.0    
[28] utf8_1.2.4           pillar_1.9.0         magrittr_2.0.3       checkmate_2.3.1      withr_3.0.0          tools_4.3.2          matrixStats_1.2.0    desc_1.4.2    
@MichaelChirico
Copy link
Member

At root the issue is we don't do length() S3 dispatch.

Apparently (#1433) we wrote our own dim in the first place for performance reasons with [[.data.frame.

Maybe we should implement [[.data.table instead to just wrap VECTOR_ELT(x, i-1), then length() can dispatch correctly. Otherwise we need to do length() dispatch ourselves from C, right?

SteveBronder added a commit to SteveBronder/data.table that referenced this issue Mar 6, 2024
@SteveBronder
Copy link
Author

I have a PR at the link below to do the length dispatch in C. It's been a long time since I've written with R's C API so it may not be that nice. I also ran into another bug where rownames(toprint) was only giving back a length 1 vector so line 80 of print.data.table was failing

  if (isTRUE(row.names)) {
   # rownames was NULL and the rhs was length 4 so this threw an error
    rownames(toprint) = paste0(format(rn,right=TRUE,scientific=FALSE),":")
  } else {
    rownames(toprint) = rep.int("", nrow(toprint))
  }

Since vectors of rvar types have a dimension format_col.default was choosing the first branch and just returning "<multi-column>". imo I think that if condition should be kicked to the second to the last of the if else statements so that package writers have an opportunity to write the proper overwrite for format

https://github.com/Rdatatable/data.table/compare/master...SteveBronder:data.table:fix/print-rvar?expand=1

@SteveBronder
Copy link
Author

Actually @MichaelChirico I think we should close this PR / issue. See the discussion in the comments here. Essentially this is another case of #4415 (specifically the comment here).

At first I thought this was a data.table issue but now I think it's rvar not conforming to data.table

@MichaelChirico
Copy link
Member

Thanks for the reminder. Indeed I'd checked if rvar is S4, but it's just S3 which I thought should be easier to accommodate in general. And indeed I wouldn't expect generic computations on rvar columns in data.table to work.

But storing non-atomic columns in a data.table is still very useful for some high-level use cases, and something as basic as print() not working is surprising to me. If we can get a fix working that's fairly general and not terribly involved/inefficient, I think we should go for it.

@SteveBronder
Copy link
Author

Thanks for the reminder. Indeed I'd checked if rvar is S4, but it's just S3 which I thought should be easier to accommodate in general. And indeed I wouldn't expect generic computations on rvar columns in data.table to work.

Yes I just was messing with rbindlist.c and I think this is a bigger project than I want to take on atm.

But storing non-atomic columns in a data.table is still very useful for some high-level use cases, and something as basic as print() not working is surprising to me. If we can get a fix working that's fairly general and not terribly involved/inefficient, I think we should go for it.

Would you like me to open up a PR for the fix branch I posted above?

@MichaelChirico
Copy link
Member

Would you like me to open up a PR for the fix branch I posted above?

that'd be great!

@MichaelChirico MichaelChirico added the non-atomic column e.g. list columns, S4 vector columns label Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-atomic column e.g. list columns, S4 vector columns
Projects
None yet
Development

No branches or pull requests

2 participants