Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster arrange() #364

Closed
markfairbanks opened this issue Jun 22, 2022 · 1 comment · Fixed by #374
Closed

Faster arrange() #364

markfairbanks opened this issue Jun 22, 2022 · 1 comment · Fixed by #374

Comments

@markfairbanks
Copy link
Collaborator

markfairbanks commented Jun 22, 2022

arrange() can utilize setorder() to be faster.

This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call copy() outright.

Note: In the case of a transmute expression inside arrange() we would need to still use order().
Ex: arrange(df, x - 1)

library(stringi)
library(data.table)

rand_strings <- stri_rand_strings(100, 4)

bench::press(
  data_size = c(10000, 1000000, 10000000),
  {
    df <- data.table(a = sample(rand_strings, data_size, TRUE),
                     b = round(runif(data_size), 4))
    bench::mark(
      order = df[order(a, b)],
      setorder = setorder(copy(df), a, b),
      check = FALSE, iterations = 30
    )
  }
)
#> Running with:
#>   data_size
#> 1     10000
#> 2   1000000
#> 3  10000000
#> # A tibble: 6 × 7
#>   expression data_size      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 order          10000   1.06ms   1.38ms    703.      2.01MB     0   
#> 2 setorder       10000   1.01ms   1.23ms    800.    371.95KB     0   
#> 3 order        1000000  56.75ms  66.49ms     15.0    19.09MB     9.97
#> 4 setorder     1000000  45.15ms   51.5ms     19.0    27.67MB    34.9 
#> 5 order       10000000 676.79ms 690.23ms      1.43  190.75MB     4.91
#> 6 setorder    10000000 560.98ms 648.71ms      1.55  276.58MB     3.88

# With implicit copy (from filter/slice/select/summarize)
bench::press(
  data_size = c(10000, 1000000, 10000000),
  {
    df <- data.table(a = sample(rand_strings, data_size, TRUE),
                     b = round(runif(data_size), 4))
    bench::mark(
      order = df[, .(a, b)][order(a, b)],
      setorder = setorder(df[, .(a, b)], a, b),
      check = FALSE, iterations = 30
    )
  }
)
#> Running with:
#>   data_size
#> 1     10000
#> 2   1000000
#> 3  10000000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 6 × 7
#>   expression data_size      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 order          10000   1.65ms   1.97ms    488.     756.3KB     0   
#> 2 setorder       10000   1.35ms   1.58ms    622.     488.4KB    21.4 
#> 3 order        1000000   73.8ms  80.88ms     11.7     49.7MB     1.30
#> 4 setorder     1000000   59.8ms  67.33ms     14.7       43MB     2.26
#> 5 order       10000000 766.43ms 914.88ms      1.09     496MB     1.71
#> 6 setorder    10000000 628.64ms 733.18ms      1.35   429.2MB     1.67

Created on 2022-06-22 by the reprex package (v2.0.1)

@markfairbanks
Copy link
Collaborator Author

markfairbanks commented Jun 22, 2022

Might be a good idea to wait on fixing #210 though

Edit: Needs #366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant