optimize the `left_str %in right_str & other-cond` queries #4799

shrektan · 2020-11-04T08:58:47Z

On Windows (but my example should be able to be reproducible on other OS), DT seems optimized the left_str %in right_str query so that it's very fast even when the left_str and right_str are in different encodings. However, this is not true when left_str %in right_str query is used with other conditions, I mean, like left_str %in right_str & A>=B.

Would be good (if it's straightforward and not too difficult) to support this. A real-case is that I found a script executed very slow... and after one-hour's debugging, I finally located the root cause... As you can imagine (once you read the below example), it's very difficult and confusing to understand the problem in the 1st place.

Thanks.

library(data.table)
n <- 1e6
utf8 <- "fa\u00e7ile"
latin1 <- iconv(utf8, from = "UTF-8", to = "latin1")
text <- sample(latin1, n, TRUE)
tbl <- data.table(LATIN1 = text, UTF8 = enc2utf8(text), NO = seq_len(n))

# all very fast
system.time({
  tbl[LATIN1 %in% latin1]
})
#>    user  system elapsed 
#>    0.03    0.00    0.03
system.time({
  tbl[UTF8 %in% latin1]
})
#>    user  system elapsed 
#>    0.05    0.00    0.03
system.time({
  tbl[LATIN1 %in% utf8]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[UTF8 %in% utf8]
})
#>    user  system elapsed 
#>    0.02    0.00    0.02

# With another expession, it's very slow, is it possible to improve?
system.time({
  tbl[LATIN1 %in% latin1 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    5.34    0.00    5.36
system.time({
  tbl[UTF8 %in% latin1 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    5.31    0.03    5.35
system.time({
  tbl[LATIN1 %in% utf8 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>   10.51    0.00   10.54
system.time({
  tbl[UTF8 %in% utf8 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    0.04    0.00    0.03

# interestingly, use `==` is not slow

system.time({
  tbl[LATIN1 %in% latin1 & NO == 100]
})
#>    user  system elapsed 
#>    0.01    0.00    0.02
system.time({
  tbl[UTF8 %in% latin1 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[LATIN1 %in% utf8 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[UTF8 %in% utf8 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0

^{Created on 2020-11-04 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                         
#>  version  R version 4.0.2 (2020-06-22)  
#>  os       Windows 10 x64                
#>  system   x86_64, mingw32               
#>  ui       RTerm                         
#>  language en                            
#>  collate  Chinese (Simplified)_China.936
#>  ctype    Chinese (Simplified)_China.936
#>  tz       Asia/Taipei                   
#>  date     2020-11-04                    
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.1.7      2020-05-13 [1] CRAN (R 4.0.0)
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
#>  data.table  * 1.13.0     2020-07-24 [1] CRAN (R 4.0.2)
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.0      2020-04-10 [1] CRAN (R 4.0.0)
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.0)
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.0)
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.3      2020-07-05 [1] CRAN (R 4.0.2)
#>  ps            1.3.4      2020-08-11 [1] CRAN (R 4.0.2)
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
#>  remotes       2.2.0      2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.4        2020-09-30 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      2.3.2.9000 2020-05-09 [1] local         
#>  usethis       1.6.1      2020-04-29 [1] CRAN (R 4.0.0)
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun          0.18       2020-09-29 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] D:/app/R_lib/4.0
#> [2] D:/app/R-4.0.0/library

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-11-04T09:27:49Z

Another expressions are >= / <=. It would help to clarify if it is the same for ==.
Does verbose say anything special? I recall we have a threshold in optimization of multiple subset conditions in i to prevent blowing out memory.

shrektan · 2020-11-04T09:39:10Z

No, == is very fast (I've included in the end of my example). Only >= or <= is slow.

jangorecki · 2020-11-04T10:18:17Z

No idea then. I also don't see any verbose output here. I was referring to this

data.table/R/data.table.R

Lines 2965 to 2969 in 70b6b13

    
           if(length(i) > 1L && prod(vapply_1i(i, length)) > 1e4){ 
        
             ## CJ would result in more than 1e4 rows. This would be inefficient, especially memory-wise #2635 
        
             if (verbose) {cat("Subsetting optimization disabled because the cross-product of RHS values exceeds 1e4, causing memory problems.\n");flush.console()} 
        
             return(NULL) 
        
           }

shrektan · 2020-11-04T14:37:36Z

Sorry, I forgot to paste the verbose output. data.table actually does optimization (using index) for %in% & == but not %in% & >=:

library(data.table)
n <- 1e6
utf8 <- "fa\u00e7ile"
latin1 <- iconv(utf8, from = "UTF-8", to = "latin1")
text <- sample(latin1, n, TRUE)
tbl <- data.table(LATIN1 = text, UTF8 = enc2utf8(text), NO = seq_len(n))

invisible(tbl[LATIN1 %in% utf8 & NO == 50, verbose=TRUE])
#> Creating new index 'NO__LATIN1'
#> Creating index NO__LATIN1 done in ... forder.c received 1000000 rows and 3 columns
#> forder took 0.039 sec
#> 0.062s elapsed (0.089s cpu) 
#> Optimized subsetting with index 'NO__LATIN1'
#> forder.c received 1 rows and 2 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> Coercing double column i.NO (which contains no fractions) to type integer to match type of x.NO.
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> i.LATIN1 has same type (character) as x.LATIN1. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> forder.c received 1 rows and 2 columns
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.001s cpu)

invisible(tbl[LATIN1 %in% utf8 & NO >= 50, verbose=TRUE])
# as you can see, this, verbose gives nothing.

^{Created on 2020-11-04 by the reprex package (v0.3.0)}

shrektan · 2020-11-04T16:54:51Z

Note, I've also filed an issue on https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17965 . In my opinion, even without the optimization, the base R %in% should not be that much slower for all of such cases, especially when the longer string is UTF-8 encoding and the shorter string is only a scalar.

franknarf1 · 2020-11-16T21:10:59Z

data.table actually does optimization (using index) for %in% & == but not %in% & >=

Fwiw, >= is still marked as incomplete here #1068

jangorecki added the performance label Nov 4, 2020

jangorecki changed the title ~~Is it possible to optimize the left_str %in right_str & other-cond query case?~~ optimize the left_str %in right_str & other-cond queries Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize the `left_str %in right_str & other-cond` queries #4799

optimize the `left_str %in right_str & other-cond` queries #4799

shrektan commented Nov 4, 2020 •

edited

Loading

jangorecki commented Nov 4, 2020 •

edited

Loading

shrektan commented Nov 4, 2020

jangorecki commented Nov 4, 2020

shrektan commented Nov 4, 2020

shrektan commented Nov 4, 2020

franknarf1 commented Nov 16, 2020

optimize the left_str %in right_str & other-cond queries #4799

optimize the left_str %in right_str & other-cond queries #4799

Comments

shrektan commented Nov 4, 2020 • edited Loading

jangorecki commented Nov 4, 2020 • edited Loading

shrektan commented Nov 4, 2020

jangorecki commented Nov 4, 2020

shrektan commented Nov 4, 2020

shrektan commented Nov 4, 2020

franknarf1 commented Nov 16, 2020

optimize the `left_str %in right_str & other-cond` queries #4799

optimize the `left_str %in right_str & other-cond` queries #4799

shrektan commented Nov 4, 2020 •

edited

Loading

jangorecki commented Nov 4, 2020 •

edited

Loading