Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GForce optimisation could be more smart #3815

Open
jangorecki opened this issue Sep 3, 2019 · 5 comments
Open

GForce optimisation could be more smart #3815

jangorecki opened this issue Sep 3, 2019 · 5 comments
Labels
GForce issues relating to optimized grouping calculations (GForce) performance top request One of our most-requested issues

Comments

@jangorecki
Copy link
Member

jangorecki commented Sep 3, 2019

d = data.table(x=1:4, y=1:2)
options(datatable.verbose=TRUE)
d[, j=.(min(x)), by=y]                       ## GForce TRUE
d[, j=.(min(x), mean(x)/min(x)), by=y]       ## GForce FALSE
d[, j={.(min(x))}, by=y]                     ## GForce TRUE
d[, j={x<-x; .(min(x))}, by=y]               ## GForce FALSE
d[, j={mn<-min(x); .(mn, mean(x)/mn)}, by=y] ## GForce FALSE
options(datatable.verbose=FALSE)

all those cases could be optimised

@jangorecki jangorecki added performance GForce issues relating to optimized grouping calculations (GForce) labels Sep 3, 2019
@jangorecki jangorecki added this to the 1.13.0 milestone Sep 20, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@ben-schwen
Copy link
Member

ben-schwen commented Apr 15, 2023

I think another case of this also appears in the db-benchmark (now resurrected by duckdblabs at https://github.com/duckdblabs/db-benchmark) as Q2 advanced group by

Since DT[, .(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE)), by=id3] could be rewritten using chaining.

@jangorecki is query rewriting something viable in the benchmarks or does this lead it in the wrong direction?

With current master this gives a good speedup

library(data.table)
n = 1e7
DT=data.table(v1=rnorm(n), v2=rnorm(n), id3=sample(1e4, n, TRUE))
bench::mark(
  now = DT[, .(range_v1_v2=max(v1, na.rm=TRUE)-min(v2, na.rm=TRUE)), by=id3],
  gforced = DT[, .(max(v1, na.rm=TRUE), min(v2, na.rm=TRUE)), id3][, .(range_v1_v2=V1-V2), by=id3]
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 now           316ms    328ms      3.05    40.3MB     0   
#> 2 gforced       126ms    135ms      6.57     268MB     8.22

@jangorecki
Copy link
Member Author

I would keep it as it is, benchmark also compares query syntax

@ColeMiller1
Copy link
Contributor

@ben-schwen somewhat related to your note, #4412 has some aspects of this but is more focused on logical expression evaluations. From the PR (note the 3 gForce optimized functions combined via |):

bench::mark(
  new_hav = dt[,
               j = .SD,
               by = x,
               having = .N < 2L | sum(y) > 11 | median(y) < 0.7
               ],
  use_I = dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1],
  time_unit = 's'
)

##   expression    min median `itr/sec` mem_alloc
##   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>
## 1 new_hav    0.0102 0.0112    81.4      6.97MB
## 2 use_I      1.21   1.21       0.829    3.21MB

I believe I had done research and got your query to work using GForce using similar methods as the PR. I'd have to review my other branches more, but it's doable to optimize without changing the way the query looks. As your benchmark notes, it's much more memory intensive this route but is generally faster.

As an aside, should this issue be closed? I am not sure how to optimize {min_x = min(x); .(min_x, min_x/mean(x))} and there are other very similar issues about optimizing expressions to GForce optimizations.

@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Nov 6, 2023
@clerousset
Copy link

Can I add to this topic that :

d = data.table(x=1:4, y=1:2)
options(datatable.verbose=TRUE)
  d[, head(x, 1), by=y]        ## GForce TRUE
  d[, utils::head(x, 1), by=y] ## GForce FALSE

?
As a simple user, sometimes we've been told in some R guidelines (maybe wrongly) to always put the package before, even for basic packages like utils

(behaviour checked for data.table 6f43a96)

@MichaelChirico
Copy link
Member

Can I add to this topic that :

d = data.table(x=1:4, y=1:2)
options(datatable.verbose=TRUE)
  d[, head(x, 1), by=y]        ## GForce TRUE
  d[, utils::head(x, 1), by=y] ## GForce FALSE

? As a simple user, sometimes we've been told in some R guidelines (maybe wrongly) to always put the package before, even for basic packages like utils

(behaviour checked for data.table 6f43a96)

that will never be supported. authors might need to invoke the "real" head to get S3 dispatch for example. GForce head completely skips method dispatch of course

my snarky opinion is that the authors telling you to 'always qualify namespace' should stick to writing C++ if that's what they want so badly 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GForce issues relating to optimized grouping calculations (GForce) performance top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

6 participants