Skip to content

Rewrite ngram implementation with an eye for cpp11 performance #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 11, 2022

Conversation

DavisVaughan
Copy link
Contributor

@DavisVaughan DavisVaughan commented Nov 11, 2022

I ended up doing a full rewrite so you can see what I did to directly fill the output vector on each ngram iteration rather than creating and returning a temporary vector each time. And so you can see the style i use for C++

It is probably "good enough" for now, but will get better when r-lib/cpp11#299 is fixed

Knowing when to unwind_protect() and when to avoid expensive (but safe) cpp11 wrappers does make this pretty challenging to get right.

With CRAN cpp11 (0.4.3)

library(friends)

friends_data <- friends::friends |>
  dplyr::slice(1:10000) |>
  dplyr::pull(text) |>
  stringr::str_split(" ")

bench::mark(
  new = cpp11ngram:::cpp11_ngram(friends_data, n = 2L, n_min = 1, delim = "_"),
  old = textrecipes:::rcpp_ngram(friends_data, n = 2L, n_min = 1, delim = "_"),
  iterations = 50
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          53.2ms   59.6ms      16.5    1.43MB    0.337
#> 2 old          30.5ms   34.2ms      29.2    1.84MB    1.87

With r-lib/cpp11#299, where the nested unwind_protect() bug is fixed:

bench::mark(
  new = cpp11ngram:::cpp11_ngram(friends_data, n = 2L, n_min = 1, delim = "_"),
  old = textrecipes:::rcpp_ngram(friends_data, n = 2L, n_min = 1, delim = "_"),
  iterations = 50
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          34.1ms   37.9ms      26.4    1.43MB     1.69
#> 2 old          32.7ms   36.5ms      27.3    1.84MB     3.04

@EmilHvitfeldt
Copy link
Owner

Thank you loads!

@EmilHvitfeldt EmilHvitfeldt merged commit 0b1997c into EmilHvitfeldt:main Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants