Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminating copies in reference assignment #4767

Open
kdkavanagh opened this issue Oct 20, 2020 · 3 comments
Open

Eliminating copies in reference assignment #4767

kdkavanagh opened this issue Oct 20, 2020 · 3 comments
Labels

Comments

@kdkavanagh
Copy link

Curious what steps I can take to eliminate these copies made due to high refcounts for objects that shouldnt(?) have any external references. Only real information I can find on this subject is Here

x = 100
df[,`:=`(
    newCol1=shift( sequence( rle(verifiedTs==F)$lengths )*( verifiedTs==F ), fill=0L),
    newCol2= customRcppFn(colA, x)
  ), verbose=T]
Detected that j uses these columns: verifiedTs, colA
Assigning to all 129989456 rows
RHS_list_of_columns == true
RHS for item 1 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==2; length(cols)==2)
RHS for item 2 has been duplicated because NAMED==7 MAYBE_SHARED==1, but then is being plonked. length(values)==2; length(cols)==2)
@jangorecki
Copy link
Member

Thanks for question.
For future please try to include reproducible example.
Personally I would play a little bit by computing {newCol1=...; print(address(newCol1)); ...} and then print address(ans$newCol1) to double check the verbose message.
I am not specialist here, but there may be not much space for us to eliminate those copies. If we cannot be absolutely sure about refs then we must copy. What if user's shift function return existing vector from his environment? not doing copy we are putting his vector on risk of being modified when df will be modified later on.

@kdkavanagh
Copy link
Author

kdkavanagh commented Oct 20, 2020

Thank Jan, reprex below. Explanation makes sense - Short of data.table being able to optimize this correctly automatically, I'm curious if there's anything I can do manually to hint to dt that this copy need-not be made.

df = data.table(x=1:5, verifiedTs=sample(c(T,F), 5, replace=T))
df[,`:=`(
    colA=shift(sequence(rle(verifiedTs==F)$lengths)*(verifiedTs==F), fill=0L)
), verbose=T]

edit, session info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.7 (Maipo)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4-7 arrow_1.0.1.20201012 ggforce_0.3.2        viridis_0.5.1        viridisLite_0.3.0    ggridges_0.5.2      
 [7] forcats_0.5.0        logger_0.1           reticulate_1.14-9001 knitr_1.30           stringr_1.4.0        optparse_1.6.6      
[13] jsonlite_1.7.1       ggplot2_3.3.2        memoise_1.1.0        glue_1.4.2           jumpr_1.0.0          data.table_1.13.2   
[19] bit64_4.0.5          bit_4.0.4                     

loaded via a namespace (and not attached):
 [1] pkgload_1.1.0      foreach_1.5.1      assertthat_0.2.1   askpass_1.1        remotes_2.2.0      yaml_2.2.1        
 [7] sessioninfo_1.1.1  pillar_1.4.6       backports_1.1.10   lattice_0.20-41    digest_0.6.26      RColorBrewer_1.1-2
[13] polyclip_1.10-0    colorspace_1.4-1   Matrix_1.2-18      plyr_1.8.6         pkgconfig_2.0.3    devtools_2.3.2    
[19] purrr_0.3.4        scales_1.1.1       processx_3.4.4     tweenr_1.0.1       getopt_1.20.3      tibble_3.0.4      
[25] openssl_1.4.3      generics_0.0.2     farver_2.0.3       usethis_1.6.3      ellipsis_0.3.1     withr_2.3.0       
[31] cli_2.1.0          magrittr_1.5       crayon_1.3.4       ps_1.4.0           fs_1.5.0           fansi_0.4.1       
[37] MASS_7.3-53        pkgbuild_1.1.0     tools_3.6.1        prettyunits_1.1.1  RcppCCTZ_0.2.9     lifecycle_0.2.0   
[43] munsell_0.5.0      callr_3.5.1        compiler_3.6.1     RcppMsgPack_0.2.3  rlang_0.4.8        grid_3.6.1        
[49] iterators_1.0.13   rstudioapi_0.11    rappdirs_0.3.1     testthat_2.3.2     gtable_0.3.0       codetools_0.2-16  
[55] curl_4.3           R6_2.4.1           gridExtra_2.3      zoo_1.8-8          dplyr_1.0.2        rprojroot_1.3-2   
[61] desc_1.2.0         stringi_1.5.3      Rcpp_1.0.5         vctrs_0.3.4        nanotime_0.3.2     tidyselect_1.1.0  
[67] xfun_0.18  

@jangorecki
Copy link
Member

There are no any hints like this. That would be a FR which I am not sure if we want to have in scope of package. It open doors for similar issues like setDT, so using it when as.data.table is more appropriate. WDYT @mattdowle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants