-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize the left_str %in right_str & other-cond
queries
#4799
Comments
Another expressions are |
left_str %in right_str & other-cond
query case?left_str %in right_str & other-cond
queries
No, |
No idea then. I also don't see any verbose output here. I was referring to this Lines 2965 to 2969 in 70b6b13
|
Sorry, I forgot to paste the verbose output. data.table actually does optimization (using index) for library(data.table)
n <- 1e6
utf8 <- "fa\u00e7ile"
latin1 <- iconv(utf8, from = "UTF-8", to = "latin1")
text <- sample(latin1, n, TRUE)
tbl <- data.table(LATIN1 = text, UTF8 = enc2utf8(text), NO = seq_len(n))
invisible(tbl[LATIN1 %in% utf8 & NO == 50, verbose=TRUE])
#> Creating new index 'NO__LATIN1'
#> Creating index NO__LATIN1 done in ... forder.c received 1000000 rows and 3 columns
#> forder took 0.039 sec
#> 0.062s elapsed (0.089s cpu)
#> Optimized subsetting with index 'NO__LATIN1'
#> forder.c received 1 rows and 2 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> Coercing double column i.NO (which contains no fractions) to type integer to match type of x.NO.
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> Assigning to all 1 rows
#> RHS_list_of_columns == false
#> RHS for item 1 has been duplicated because NAMED==2 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
#> i.LATIN1 has same type (character) as x.LATIN1. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> forder.c received 1 rows and 2 columns
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.001s cpu)
invisible(tbl[LATIN1 %in% utf8 & NO >= 50, verbose=TRUE])
# as you can see, this, verbose gives nothing. Created on 2020-11-04 by the reprex package (v0.3.0) |
Note, I've also filed an issue on https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17965 . In my opinion, even without the optimization, the base R |
Fwiw, |
On Windows (but my example should be able to be reproducible on other OS), DT seems optimized the
left_str %in right_str
query so that it's very fast even when theleft_str
andright_str
are in different encodings. However, this is not true whenleft_str %in right_str
query is used with other conditions, I mean, likeleft_str %in right_str & A>=B
.Would be good (if it's straightforward and not too difficult) to support this. A real-case is that I found a script executed very slow... and after one-hour's debugging, I finally located the root cause... As you can imagine (once you read the below example), it's very difficult and confusing to understand the problem in the 1st place.
Thanks.
Created on 2020-11-04 by the reprex package (v0.3.0)
Session info
The text was updated successfully, but these errors were encountered: