[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

fabiocs8 · 2020-07-27T12:46:05Z

I have a script that takes only 4 seconds to process in V 1.12.8, and more than 24+ hours when using V 1.13.0. All other variables are the same (computer, r and RStudio versions, dataset, etc.)

All relevant and detailed information about this issue is contained in Stack Overflow.

https://stackoverflow.com/questions/63105711/why-data-table-unnesting-time-grows-with-the-square-of-number-of-rows-for-a-spec
I am not a programmer, so I am not able to go through the code. Below you will find the sessionInfo() reg\rding thje environment AFTER changing back to V 1.12.8.

Please let me know whether I can be of any further help.

BTW, congratulations for this amazing gamecahnger package.

Regards,

Fabio.

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] microbenchmark_1.4-7 data.table_1.12.8 lubridate_1.7.9 stringi_1.4.6 runner_0.3.7 e1071_1.7-3
[7] ggplot2_3.3.2 stringr_1.4.0 magrittr_1.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 class_7.3-17 tools_4.0.2 digest_0.6.25 packrat_0.5.0 evaluate_0.14
[9] lifecycle_0.2.0 tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 rstudioapi_0.11 yaml_2.2.1 xfun_0.16
[17] withr_2.2.0 dplyr_1.0.0 knitr_1.29 generics_0.0.2 vctrs_0.3.2 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
[25] R6_2.4.1 rmarkdown_2.3 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0 colorspace_1.4-1 tinytex_0.25
[33] munsell_0.5.0 crayon_1.3.4

mattdowle · 2020-07-27T20:16:04Z

Many thanks for the great report and I've downloaded your RDS from S.O.

There was a change to [[ by group in this release so we'll have to look into that, bug fix 5 in news :

GForce is deactivated for [[ on non-atomic input, part of #4159. Thanks @hongyuanjia and @ColeMiller1 for helping debug an issue in dev with the original fix before release, #4612.

The verbose output in the S.O. (thanks) show the following :

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Which is more a clue for us to fix in this case, than for the user.

MichaelChirico · 2020-07-27T23:53:38Z

I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command.

Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 !

fabiocs8 · 2020-07-28T00:17:01Z

Glad to help you. I would like to deactivate the link for the dataset, Have you already make a copy of it? Regards, fabio.

…

On 27/07/2020 20:53, Michael Chirico wrote: I had the same initial thought for the cause, bit SO also reported the same slowdown for an unlist command. Anyway there's repro data now which makes all the difference. Thanks @fabiocs8 <https://github.com/fabiocs8> ! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4646 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI5ISCJMUQHWCF657E56FRTR5YHRDANCNFSM4PIXKGRA>.

-- *Tecknowledge Consultoria e Serviços LTDA* Fábio Corrêa da Silva + 55 11 98771-3800 | fabio.correa@tkedge.com

fabiocs8 · 2020-07-28T00:26:46Z

Glad to help. Because you have downloaded the dataset, I will deactivate the link. Please let me know whether I can be of further help. Regards, Fabio.

…

On 27/07/2020 17:16, Matt Dowle wrote: Thank you for the report and I've downloaded your RDS from S.O. There was a change to |[[| by group in this release so we'll have to look into that. GForce is deactivated for [[ on non-atomic input, part of #4159 <#4159>. Thanks @hongyuanjia <https://github.com/hongyuanjia> and @ColeMiller1 <https://github.com/ColeMiller1> for helping debug an issue in dev with the original fix before release, #4612 <#4612>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4646 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI5ISCPE2P37HOY42I2S75TR5XOBJANCNFSM4PIXKGRA>.

-- *Tecknowledge Consultoria e Serviços LTDA* Fábio Corrêa da Silva + 55 11 98771-3800 | fabio.correa@tkedge.com

ColeMiller1 · 2020-07-28T02:24:44Z

Thanks for doing all the work to re-installing 1.12.8 to provide additional information. This type of info is great!

The most interesting thing in 1.13.0 is that there is a major performance difference between dt[, v[[1L]], by = grps] versus dt[seq_len(.N), v[[1L]], by = grps] for your RDS. I could only reproduce the performance degradation using the RDS and not the simulated data. Here is some of the verbose output in 1.13.0:

##dt[seq_len(.N), ...]
collecting discontiguous groups took 0.000s for 5000 groups 

##dt[, ...]
memcpy contiguous groups took 50.502s for 5000 groups

It does appear to be a regression from 1.12.8. I need to deactivate the gforce optimization using parentheses (e.g. (val[[1L]])), and I get very different results:

##dt[seq_len(.N), (...), ...]
collecting discontiguous groups took 0.004s for 5000 groups 

##dt[, (...), ...]
memcpy contiguous groups took 0.000s for 5000 groups

So it appears to be related to memcpy. For right now, it appears a workaround in 1.13.0 may be item_res[seq_len(.N), lance[[1]], by = item_id].

In the meantime - here's a way for simulated data to reproduce the issue:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(id = seq_len(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[seq_len(.N), value[[1L]], by = id]
    , dt[, value[[1L]], by = id]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                 min   median `itr/sec` mem_alloc
#>   <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 dt[seq_len(.N), value[[1L]], by = id]   2.56ms   3.05ms    287.      2.24MB
#> 2 dt[, value[[1L]], by = id]            219.41ms 250.98ms      3.98   51.91MB
#> # ... with 1 more variable: `gc/sec` <dbl>

Using profmem, our seq_len(.N) approach has 29 allocations. Our NULL i has around 250,000 allocations, many of which are 208 byes of alloc.

library(profmem)
profmem( dt[seq_len(.N), value[[1L]], by = id])
profmem( dt[, value[[1L]], by = id]) ## SLOW!

fabiocs8 · 2020-07-28T12:40:20Z

I noticed other related weird behaviors. All the commands below were run under V 1.12.8 on the full database:

1) Problems when using item[, lance[[1]], by = unnest_names,verbose = T], see below:

> microbenchmark(
    item_int <- item[, lance[[1]], by = unnest_names,verbose = T]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.170s elapsed (0.200s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'lance[[1]]'
GForce optimized j to 'g[[(lance, 1)'
Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.002
gforce assign high and low took 0.002
gforce eval took 0.035
0.130s elapsed (0.140s cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, lance[[1]], by = unnest_names, verbose = T] 1.175517 1.175517 1.175517 1.175517 1.175517 1.175517 1

> item_int
Error in [.data.table(x, i, , ) :
Column 12 ['V1'] is a data.frame or data.table; malformed data.table.

Note 1) that data.table has not issued any error or warning, and 2) the output number of rows is the same of the input data (872,851), it does not make sense (the output number of rows should be 16,070,070).

2) item_int <- item[, unlist(lance, recursive = F), by = item_id] runs smooth and correctly

> microbenchmark(
    item_int <- item[, unlist(lance, recursive = F), by = unnest_names]
    , times = 5L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.140s elapsed (0.120s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'unlist(lance, recursive = F)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.074s for 872581 groups
eval(j) took 1.400s for 872581 calls
8.200s elapsed (6.940s cpu)
Unit: seconds
expr min lq mean median uq
item_int <- item[, unlist(lance, recursive = F), by = unnest_names, verbose = TRUE] 5.777739 8.352493 8.204699 8.3721 8.698475
max neval
9.822689 5

Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, unlist(lance, recursive = F), by = unnest_names] 8.022987 8.245593 8.93737 8.914378 8.982508 10.52138 5

3) item_int <- item[, rbindlist(lance), by = unnest_names] runs correctly, but takes a long time... does it make sense?

> microbenchmark(
    item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE]
    , times = 1L)

Detected that j uses these columns: lance
Finding groups using forderv ... forder.c received 872581 rows and 11 columns
0.300s elapsed (0.280s cpu)
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
lapply optimization is on, j unchanged as 'rbindlist(lance)'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
dogroups: growing from 872581 to 18513352 rows
Wrote less rows (16070070) than allocated (18513352).

memcpy contiguous groups took 0.113s for 872581 groups
eval(j) took 98.270s for 872581 calls
00:01:43 elapsed (00:01:39 cpu)
Unit: seconds
expr min lq mean median uq max neval
item_int <- item[, rbindlist(lance), by = unnest_names, verbose = TRUE] 103.8269 103.8269 103.8269 103.8269 103.8269 103.8269 1

Regards,

Fabio.

ColeMiller1 · 2020-07-29T01:19:38Z

Additional research: #4164 is the commit where this started.

## merged commit before #4164
##remotes::install_github("https://github.com/Rdatatable/data.table/tree/793f8545c363d222de18ac892bc7abb80154e724") # good

##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                    memory               time         gc             
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                    <list>               <list>       <list>         
##1 dt[seq_len(.N), value[[1L]], by = id]   3.68ms   4.06ms      232.    2.26MB     5.11    91     2      392ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [260 ~ <bch:tm [93~ <tibble [93 x ~
##2 dt[, value[[1L]], by = id]              2.06ms   2.29ms      430.  401.24KB     4.16   207     2      481ms <data.table[,2] [20,000 ~ <Rprofmem[,3] [17 x~ <bch:tm [20~ <tibble [209 x~

## final commit with merged #4164 
# remotes::install_github("https://github.com/Rdatatable/data.table/tree/4aadde8f5a51cd7c8f3889964e7280432ec65bbc") #bad - merged commit
##  expression                                 min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result                   memory                  time        gc            
##  <bch:expr>                            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>                   <list>                  <list>      <list>        
##1 dt[seq_len(.N), value[[1L]], by = id]   1.25ms   1.39ms    572.      2.26MB    12.0    286     6      500ms <data.table[,2] [20,000~ <Rprofmem[,3] [260 x 3~ <bch:tm [2~ <tibble [286 ~
##2 dt[, value[[1L]], by = id]            173.01ms 188.94ms      5.29   52.03MB     5.29     3     3      567ms <data.table[,2] [20,000~ <Rprofmem[,3] [251,669~ <bch:tm [3~ <tibble [3 x ~

The last thing I looked into is that if byval is ordered (e.g., length(o__) == 0L), then memrecycle is triggered. So while ideally memrecycle would be more performant on lists, another way would be to only do continuous gathering when length(o__) && !any_typeof_cols_is_list) (e.g., when there is a list column, we'll use subsetVectorRaw instead). Alternatively we could look into an unnest function more (#4156).

Code between discontinuous vs. continuous:

data.table/src/dogroups.c

Lines 161 to 179 in db61844

    
           if (LENGTH(order)==0) { 
        
             const int rownum = grpn ? istarts[i]-1 : -1; 
        
             for (int j=0; j<grpn; ++j) iI[j] = rownum+j+1; 
        
             if (rownum>=0) { 
        
               for (int j=0; j<length(SDall); ++j) 
        
                 memrecycle(VECTOR_ELT(SDall,j), R_NilValue, 0, grpn, VECTOR_ELT(dt, INTEGER(dtcols)[j]-1), rownum, grpn, j+1, "Internal error assigning to SDall"); 
        
               for (int j=0; j<length(xSD); ++j) 
        
                 memrecycle(VECTOR_ELT(xSD,j), R_NilValue, 0, 1, VECTOR_ELT(dt, INTEGER(xjiscols)[j]-1), rownum, 1, j+1, "Internal error assigning to xSD"); 
        
             } 
        
             if (verbose) { tblock[0] += clock()-tstart; nblock[0]++; } 
        
           } else { 
        
             for (int k=0; k<grpn; ++k) iI[k] = iorder[ istarts[i]-1 + k ]; 
        
             for (int j=0; j<length(SDall); ++j) { 
        
               // this is the main non-contiguous gather, and is parallel (within-column) for non-SEXP 
        
               subsetVectorRaw(VECTOR_ELT(SDall,j), VECTOR_ELT(dt,INTEGER(dtcols)[j]-1), I, /*anyNA=*/false); 
        
             } 
        
             if (verbose) { tblock[1] += clock()-tstart; nblock[1]++; } 
        
             // The two blocks have separate timing statements to make sure which is running 
        
           }

Performance difference:

library(data.table) #1.13.0

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
                unordered_id = sample(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[, value[[1L]], by = ordered_id]
    , dt[, value[[1L]], by = unordered_id]
    , check = FALSE
    , time_unit = "ms" 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                              min median `itr/sec` mem_alloc
#>   <bch:expr>                            <dbl>  <dbl>     <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id]   93.5   124.        8.25    53.7MB
#> 2 dt[, value[[1L]], by = unordered_id]  0.634   1.69    629.     409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl>

I will work on a PR for !any_type_of_col_is_list to do subsetVectorRaw this weekend unless I hear there's a better route to fix the root cause.

mattdowle added this to the 1.13.1 milestone Jul 27, 2020

mattdowle changed the title ~~Version 1.13.0 takes forever (24 hours +) instead of 4 seconds when using version 1.12.8~~ [[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 Jul 27, 2020

ColeMiller1 mentioned this issue Aug 1, 2020

[[ by group performance #4655

Merged

tdeenes mentioned this issue Aug 3, 2020

1.13.0 slow down in a repeated loop on list column #4658

Closed

markfairbanks mentioned this issue Aug 18, 2020

dt_unnest() is slow with data.table 1.13.0 update TysonStanley/tidyfast#34

Closed

jangorecki added the performance label Aug 25, 2020

jangorecki mentioned this issue Aug 25, 2020

Continuous Benchmarking #4687

Closed

9 tasks

mattdowle closed this as completed in #4655 Sep 24, 2020

markfairbanks mentioned this issue Jul 2, 2021

Add unnest() tidyverse/dtplyr#266

Open

DorisAmoakohene mentioned this issue Oct 6, 2023

#performance regression 4646 DorisAmoakohene/data.table_test#3

Open

DorisAmoakohene mentioned this issue Mar 29, 2024

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646 DorisAmoakohene/PerformanceTest_data.table#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

fabiocs8 commented Jul 27, 2020

mattdowle commented Jul 27, 2020 •

edited

Loading

MichaelChirico commented Jul 27, 2020

fabiocs8 commented Jul 28, 2020 via email

fabiocs8 commented Jul 28, 2020 via email

ColeMiller1 commented Jul 28, 2020 •

edited

Loading

fabiocs8 commented Jul 28, 2020

ColeMiller1 commented Jul 29, 2020 •

edited

Loading

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

Comments

fabiocs8 commented Jul 27, 2020

mattdowle commented Jul 27, 2020 • edited Loading

MichaelChirico commented Jul 27, 2020

fabiocs8 commented Jul 28, 2020 via email

fabiocs8 commented Jul 28, 2020 via email

ColeMiller1 commented Jul 28, 2020 • edited Loading

fabiocs8 commented Jul 28, 2020

ColeMiller1 commented Jul 29, 2020 • edited Loading

mattdowle commented Jul 27, 2020 •

edited

Loading

ColeMiller1 commented Jul 28, 2020 •

edited

Loading

ColeMiller1 commented Jul 29, 2020 •

edited

Loading