Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize the left_str %in right_str & other-cond queries #4799

Open
shrektan opened this issue Nov 4, 2020 · 6 comments
Open

optimize the left_str %in right_str & other-cond queries #4799

shrektan opened this issue Nov 4, 2020 · 6 comments

Comments

@shrektan
Copy link
Member

shrektan commented Nov 4, 2020

On Windows (but my example should be able to be reproducible on other OS), DT seems optimized the left_str %in right_str query so that it's very fast even when the left_str and right_str are in different encodings. However, this is not true when left_str %in right_str query is used with other conditions, I mean, like left_str %in right_str & A>=B.

Would be good (if it's straightforward and not too difficult) to support this. A real-case is that I found a script executed very slow... and after one-hour's debugging, I finally located the root cause... As you can imagine (once you read the below example), it's very difficult and confusing to understand the problem in the 1st place.

Thanks.

library(data.table)
n <- 1e6
utf8 <- "fa\u00e7ile"
latin1 <- iconv(utf8, from = "UTF-8", to = "latin1")
text <- sample(latin1, n, TRUE)
tbl <- data.table(LATIN1 = text, UTF8 = enc2utf8(text), NO = seq_len(n))

# all very fast
system.time({
  tbl[LATIN1 %in% latin1]
})
#>    user  system elapsed 
#>    0.03    0.00    0.03
system.time({
  tbl[UTF8 %in% latin1]
})
#>    user  system elapsed 
#>    0.05    0.00    0.03
system.time({
  tbl[LATIN1 %in% utf8]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[UTF8 %in% utf8]
})
#>    user  system elapsed 
#>    0.02    0.00    0.02

# With another expession, it's very slow, is it possible to improve?
system.time({
  tbl[LATIN1 %in% latin1 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    5.34    0.00    5.36
system.time({
  tbl[UTF8 %in% latin1 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    5.31    0.03    5.35
system.time({
  tbl[LATIN1 %in% utf8 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>   10.51    0.00   10.54
system.time({
  tbl[UTF8 %in% utf8 & NO >= 50 & NO <= 100]
})
#>    user  system elapsed 
#>    0.04    0.00    0.03

# interestingly, use `==` is not slow

system.time({
  tbl[LATIN1 %in% latin1 & NO == 100]
})
#>    user  system elapsed 
#>    0.01    0.00    0.02
system.time({
  tbl[UTF8 %in% latin1 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[LATIN1 %in% utf8 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0
system.time({
  tbl[UTF8 %in% utf8 & NO == 100]
})
#>    user  system elapsed 
#>       0       0       0

Created on 2020-11-04 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                         
#>  version  R version 4.0.2 (2020-06-22)  
#>  os       Windows 10 x64                
#>  system   x86_64, mingw32               
#>  ui       RTerm                         
#>  language en                            
#>  collate  Chinese (Simplified)_China.936
#>  ctype    Chinese (Simplified)_China.936
#>  tz       Asia/Taipei                   
#>  date     2020-11-04                    
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.1.7      2020-05-13 [1] CRAN (R 4.0.0)
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
#>  data.table  * 1.13.0     2020-07-24 [1] CRAN (R 4.0.2)
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.0      2020-04-10 [1] CRAN (R 4.0.0)
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.0)
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.0)
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.3      2020-07-05 [1] CRAN (R 4.0.2)
#>  ps            1.3.4      2020-08-11 [1] CRAN (R 4.0.2)
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
#>  remotes       2.2.0      2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.4        2020-09-30 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      2.3.2.9000 2020-05-09 [1] local         
#>  usethis       1.6.1      2020-04-29 [1] CRAN (R 4.0.0)
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun          0.18       2020-09-29 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] D:/app/R_lib/4.0
#> [2] D:/app/R-4.0.0/library
@jangorecki
Copy link
Member

jangorecki commented Nov 4, 2020

Another expressions are >= / <=. It would help to clarify if it is the same for ==.
Does verbose say anything special? I recall we have a threshold in optimization of multiple subset conditions in i to prevent blowing out memory.

@jangorecki jangorecki changed the title Is it possible to optimize the left_str %in right_str & other-cond query case? optimize the left_str %in right_str & other-cond queries Nov 4, 2020
@shrektan
Copy link
Member Author

shrektan commented Nov 4, 2020

No, == is very fast (I've included in the end of my example). Only >= or <= is slow.

@jangorecki
Copy link
Member

No idea then. I also don't see any verbose output here. I was referring to this

data.table/R/data.table.R

Lines 2965 to 2969 in 70b6b13

if(length(i) > 1L && prod(vapply_1i(i, length)) > 1e4){
## CJ would result in more than 1e4 rows. This would be inefficient, especially memory-wise #2635
if (verbose) {cat("Subsetting optimization disabled because the cross-product of RHS values exceeds 1e4, causing memory problems.\n");flush.console()}
return(NULL)
}

@shrektan
Copy link
Member Author

shrektan commented Nov 4, 2020

Sorry, I forgot to paste the verbose output. data.table actually does optimization (using index) for %in% & == but not %in% & >=:

library(data.table)
n <- 1e6
utf8 <- "fa\u00e7ile"
latin1 <- iconv(utf8, from = "UTF-8", to = "latin1")
text <- sample(latin1, n, TRUE)
tbl <- data.table(LATIN1 = text, UTF8 = enc2utf8(text), NO = seq_len(n))

invisible(tbl[LATIN1 %in% utf8 & NO == 50, verbose=TRUE])
#> Creating new index 'NO__LATIN1'
#> Creating index NO__LATIN1 done in ... forder.c received 1000000 rows and 3 columns
#> forder took 0.039 sec
#> 0.062s elapsed (0.089s cpu) 
#> Optimized subsetting with index 'NO__LATIN1'
#> forder.c received 1 rows and 2 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> Coercing double column i.NO (which contains no fractions) to type integer to match type of x.NO.
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> i.LATIN1 has same type (character) as x.LATIN1. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> forder.c received 1 rows and 2 columns
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.001s cpu)

invisible(tbl[LATIN1 %in% utf8 & NO >= 50, verbose=TRUE])
# as you can see, this, verbose gives nothing.

Created on 2020-11-04 by the reprex package (v0.3.0)

@shrektan
Copy link
Member Author

shrektan commented Nov 4, 2020

Note, I've also filed an issue on https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17965 . In my opinion, even without the optimization, the base R %in% should not be that much slower for all of such cases, especially when the longer string is UTF-8 encoding and the shorter string is only a scalar.

@franknarf1
Copy link
Contributor

data.table actually does optimization (using index) for %in% & == but not %in% & >=

Fwiw, >= is still marked as incomplete here #1068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants