-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up logical indexing with multiple conditions #4105
Comments
You are not really doing single index, but more a single map (TRUE/FALSE), index would be when you would call |
In python we implemented short-circuiting semantics for |
I think this is #2472 redux |
@st-pasha's idea of short circuiting was my original thought. On further reflection, NAs could cause some confusion since lazy evaluation of |
but |
@ben519 The "NA" means "not available", aka "missing" value. In other words, it is some kind of value, we just don't know which. Thus, it's a random variable, a Schrodinger cat; not Galactus-devourer-of-the-worlds. As it happens, if the hidden value of |
A related StackOverflow post. |
To me it makes sense for base R to have n-ary operator `&`(...). that would
allow the short-out to happen at C level (currently the short-out happens,
but only pairwise)
I don't think the current language structure would allow a&b&c to be
re-routed to this (evaluation is not "lazy enough" in the sense that this
will always be parsed to two iterated functions when it could be optimized
to one), but if such an option existed we could potentially use NSE to do
so ourselves.
…On Mon, Dec 16, 2019, 10:29 AM ColeM1 ***@***.***> wrote:
A related StackOverflow post
<https://stackoverflow.com/questions/58557831/performance-benefits-of-chaining-over-anding-when-filtering-a-data-table/58559134#58559134>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4105?email_source=notifications&email_token=AB2BA5L6YAAWJA23ZC26H5TQY3RW3A5CNFSM4J2B5W7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG5KJLY#issuecomment-565879983>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5PUFEIJF4PZ5CPXFWTQY3RW3ANCNFSM4J2B5W7A>
.
|
I would add that short circuit support to #3663, then we also don't have to |
I pushed forward #3663, added short-circuit and library(data.table)
fwhich = data.table:::fwhich
set.seed(108)
N = 1e8L # 35 GB mem required, 20 threads used
foo = data.table(
x = as.character(runif(n = N)),
y = as.character(runif(n = N)),
z = as.character(runif(n = N))
)
invisible({foo[c(1L,N)]; foo[c(1L,N)]; foo[c(1L,N)]}) # warmup
## easy
system.time(foo[like(x, "123")][like(y, "123")][like(z, "123")])
# user system elapsed
# 34.491 2.200 36.693
system.time(foo[like(x, "123") & like(y, "123") & like(z, "123")])
# user system elapsed
#102.829 6.408 109.241
system.time(foo[fwhich(like(x, "123") & like(y, "123") & like(z, "123"))])
# user system elapsed
# 33.188 1.156 34.346
## hard
system.time(foo[like(x, "*")][like(y, "*")][like(z, "*")])
# user system elapsed
# 82.554 9.927 92.484
system.time(foo[like(x, "*") & like(y, "*") & like(z, "*")])
# user system elapsed
# 41.066 4.920 45.988
system.time(foo[fwhich(like(x, "*") & like(y, "*") & like(z, "*"))])
# user system elapsed
# 74.307 8.320 82.324
## !easy
system.time(foo[!like(x, "123")][!like(y, "123")][!like(z, "123")])
# user system elapsed
#151.403 15.151 166.561
system.time(foo[!like(x, "123") & !like(y, "123") & !like(z, "123")])
# user system elapsed
#109.646 9.911 119.562
system.time(foo[fwhich(!like(x, "123") & !like(y, "123") & !like(z, "123"))])
# user system elapsed
#142.395 13.652 155.758
## !hard
system.time(foo[!like(x, "*")][!like(y, "*")][!like(z, "*")])
# user system elapsed
# 10.400 0.812 11.213
system.time(foo[!like(x, "*") & !like(y, "*") & !like(z, "*")])
# user system elapsed
# 34.604 2.864 37.470
system.time(foo[fwhich(!like(x, "*") & !like(y, "*") & !like(z, "*"))])
# user system elapsed
# 10.689 0.264 10.953 |
Interesting results. Any idea why |
Very good question. These are the cases where most (or all) of rows are being returned from each filter. So ordinary way just materialize 3 logical vectors of length nrow (reduce '&' will result in 2 extra logical vectors). fwhich on the other hand will not benefit much from short-circuit because the number of rows returned is close (or equal) to nrow. Your approach is suffering exactly the same in regards to this. Because fwhich(like) uses R C api eval(grep) - grep is just what 'like' function calls - it has to return 1-based indices (and not 0-based indices if that would be C call not involving 'eval'). This 0-based indices are needed so it can mix with operators that are optimized well and don't use R C eval, allowing |
Can't say I fully understand, but I decided to test this for myself using Rcpp and sure enough, lazy evaluation is (well, can be) slower than normal evaluation. For anyone interested,
|
This issue actually asks for a general optimizations of compound AND filters, and uses library(data.table)
options(datatable.auto.index=FALSE)
fwhich = data.table:::fwhich
sample_all = function(unq_n, size) {
stopifnot(unq_n <= size)
unq_sub = seq_len(unq_n)
sample(c(unq_sub, sample(unq_sub, size=max(size-unq_n, 0), replace=TRUE)))
}
set.seed(108)
N = 1e8L # timings for 1e8L, 20th, 6 GB mem required
foo = data.table(
x = sample_all(N/10L, N),
y = sample_all(N/10L, N),
z = sample_all(N/10L, N)
)
invisible({foo[c(1L,N)]; foo[c(1L,N)]; foo[c(1L,N)]}) # warmup
if (N==1e6L) { #foo[.N/2L]
s1 = 80332L; s2 = 8563L; s3 = 63039L
} else if (N==1e8L) {
s1 = 7065182L; s2 = 7013931L; s3 = 8875689L
}
mid = as.integer(seq(as.integer(N*0.05), as.integer(N*0.95))) # mid ~0.9 obs
## easy
system.time(foo[x==s1][y==s2][z==s3])
# user system elapsed
# 0.429 0.157 0.463
system.time(foo[x==s1 & y==s2 & z==s3])
# user system elapsed
# 1.051 0.765 1.734
system.time(foo[fwhich(x==s1 & y==s2 & z==s3)])
# user system elapsed
# 0.071 0.000 0.071
## hard
system.time(foo[x%in%mid][y%in%mid][z%in%mid])
# user system elapsed
# 36.101 7.376 38.265
system.time(foo[x%in%mid & y%in%mid & z%in%mid])
# user system elapsed
# 25.043 4.376 28.430
#system.time(foo[fwhich(x%in%mid & y%in%mid & z%in%mid)])
# TOO SLOW!
## !easy
system.time(foo[x!=s1][y!=s2][z!=s3])
# user system elapsed
# 4.023 5.463 3.308
system.time(foo[x!=s1 & y!=s2 & z!=s3])
# user system elapsed
# 2.054 2.591 2.551
system.time(foo[fwhich(x!=s1 & y!=s2 & z!=s3)])
# user system elapsed
# 1.736 1.674 1.144
## !hard
system.time(foo[!x%in%mid][!y%in%mid][!z%in%mid])
# user system elapsed
# 35.832 7.684 39.046
system.time(foo[!x%in%mid & !y%in%mid & !z%in%mid])
# user system elapsed
# 25.540 5.131 29.200
system.time(foo[fwhich(!x%in%mid & !y%in%mid & !z%in%mid)])
# user system elapsed
# 1.728 1.651 1.192 code in https://github.com/Rdatatable/data.table/tree/fwhich branch |
We could include a counter on the amount of times the condition is
|
better to just address this single slow case and eventually have it plugged-in so users don't need to know to call |
I like not having users know to call FWIW, this is what I think
|
Haven't yet got precise idea for that, but some potential way I would explore
other ideas very welcome |
I like it setting a threshold. There may be some benefit of having this interface visible to users with maybe
Regarding point 1, it may look something like this in R: library(data.table)
dt = data.table(a = sample(50, 1e6, TRUE),
b = sample(50, 1e6, TRUE),
c = sample(50, 1e6, TRUE))
lgl_ind = dt[['a']] == 5L
if (sum(lgl_ind) <= 0.1 * nrow(dt)) {
ind <- which(lgl_ind)
for(col in c('b','c')){
ind <- ind[dt[[col]][ind] == 5L]
}
dt[ind]
} else {
lgl_ind = lgl_ind & dt[['b']] == 5L & dt[['c']] == 5L
}
dt[lgl_ind] |
|
Have a look at this example
As shown, I get a big speedup when I use successive logical conditions rather than and'ing them together to make a single index. This implies that data.table is checking each condition for each row rather than checking conditions lazily. Is this by design? This seems like an opportunity for a nice speedup.
The text was updated successfully, but these errors were encountered: