Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

Optimization #20

Open
chainsawriot opened this issue Nov 15, 2023 · 11 comments
Open

Optimization #20

chainsawriot opened this issue Nov 15, 2023 · 11 comments

Comments

@chainsawriot
Copy link
Contributor

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens
bench::mark(tokens_proximity(toks, c("war")))

Takes almost 4s for only 59 documents.

@chainsawriot
Copy link
Contributor Author

To my surprise, the slowest part is get_min (not .cal_dist)

return(purrr::map_dbl(poss, .get_min, x = res) + count_from)

@schochastics
Copy link
Member

(Disclaimer: Not familiar with quanteda)

It also depends on the Keyword used?

R> bench::mark(
       war = tokens_proximity(toks, c("war")),
       senate = tokens_proximity(toks, c("Senate")), check=FALSE
   )
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 war           2.63s    2.63s     0.380    6.96MB     12.9     1    34      2.63s <NULL> <Rprofmem>
2 senate     594.03ms 594.03ms     1.68     4.22MB     11.8     1     7   594.03ms <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

@schochastics
Copy link
Member

schochastics commented Nov 16, 2023

@chainsawriot suggestion for speedup:

#current version
f <- function(){
     res <- lapply(target_idx, .cal_dist, poss = poss)
     purrr::map_dbl(poss, .get_min, x = res) + count_from
}
#new version
g <- function(){
    res <- sapply(target_idx, .cal_dist, poss = poss)
    apply(res,1,min)+count_from
}
 
bench::mark(
    f(),
    g(),iterations = 100
)
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 f()         62.71ms  64.06ms      15.6    48.2KB     9.57    62    38      3.97s <dbl>  <Rprofmem>
2 g()          1.07ms   1.15ms     852.    132.5KB    17.4     98     2   114.99ms <dbl>  <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

Happy to make a PR if this would work for you

@chainsawriot
Copy link
Contributor Author

@schochastics Yes. Because the calculation of proximity is skipped when the keyword is not found in the text. Try a word that occurs in all articles, e.g. "a". It takes 12s on my machine.

@schochastics
Copy link
Member

with #22 and #23, we probably reached what is possible

require(quanteda)
require(quanteda.proximity)

toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"    94ms  106ms      7.28     150MB     47.3

Rprof()
tmp <- tokens_proximity(toks, c("a"))
Rprof(NULL)
summaryRprof()
#> $by.self
#>               self.time self.pct total.time total.pct
#> "<Anonymous>"      0.02    33.33       0.06    100.00
#> "as.vector"        0.02    33.33       0.02     33.33
#> "FUN"              0.02    33.33       0.02     33.33
#>
#> $by.total
#>                           total.time total.pct self.time self.pct
#> "<Anonymous>"                   0.06    100.00      0.02    33.33
#> ".f"                            0.06    100.00      0.00     0.00
#> "block_exec"                    0.06    100.00      0.00     0.00
#> "call_block"                    0.06    100.00      0.00     0.00
#> "call_with_cleanup"             0.06    100.00      0.00     0.00
#> "do.call"                       0.06    100.00      0.00     0.00
#> "doTryCatch"                    0.06    100.00      0.00     0.00
#> "eng_r"                         0.06    100.00      0.00     0.00
#> "eval_with_user_handlers"       0.06    100.00      0.00     0.00
#> "eval"                          0.06    100.00      0.00     0.00
#> "evaluate_call"                 0.06    100.00      0.00     0.00
#> "evaluate::evaluate"            0.06    100.00      0.00     0.00
#> "evaluate"                      0.06    100.00      0.00     0.00
#> "get_proximity"                 0.06    100.00      0.00     0.00
#> "handle_error"                  0.06    100.00      0.00     0.00
#> "handle"                        0.06    100.00      0.00     0.00
#> "in_dir"                        0.06    100.00      0.00     0.00
#> "in_input_dir"                  0.06    100.00      0.00     0.00
#> "knitr::knit"                   0.06    100.00      0.00     0.00
#> "map_"                          0.06    100.00      0.00     0.00
#> "process_file"                  0.06    100.00      0.00     0.00
#> "process_group.block"           0.06    100.00      0.00     0.00
#> "process_group"                 0.06    100.00      0.00     0.00
#> "purrr::map"                    0.06    100.00      0.00     0.00
#> "rmarkdown::render"             0.06    100.00      0.00     0.00
#> "saveRDS"                       0.06    100.00      0.00     0.00
#> "timing_fn"                     0.06    100.00      0.00     0.00
#> "tokens_proximity"              0.06    100.00      0.00     0.00
#> "try"                           0.06    100.00      0.00     0.00
#> "tryCatch"                      0.06    100.00      0.00     0.00
#> "tryCatchList"                  0.06    100.00      0.00     0.00
#> "tryCatchOne"                   0.06    100.00      0.00     0.00
#> "with_indexed_errors"           0.06    100.00      0.00     0.00
#> "withCallingHandlers"           0.06    100.00      0.00     0.00
#> "withVisible"                   0.06    100.00      0.00     0.00
#> "as.vector"                     0.02     33.33      0.02    33.33
#> "FUN"                           0.02     33.33      0.02    33.33
#> "as.data.frame.matrix"          0.02     33.33      0.00     0.00
#> "as.data.frame"                 0.02     33.33      0.00     0.00
#> "lapply"                        0.02     33.33      0.00     0.00
#> "sapply"                        0.02     33.33      0.00     0.00
#>
#> $sample.interval
#> [1] 0.02
#>
#> $sampling.time
#> [1] 0.06

Created on 2023-11-16 with reprex v2.0.2

@chainsawriot chainsawriot changed the title Kind of slow Optimization Nov 16, 2023
@schochastics
Copy link
Member

Ok this is a fun exercise. The most optimized C++ I could come up with.

#include <Rcpp.h>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector rowMins_optim(const Rcpp::NumericMatrix& mat) {
    int nRows = mat.nrow();
    int nCols = mat.ncol();
    Rcpp::NumericVector mins(nRows);

    for(int i = 0; i < nRows; ++i) {
        double minValue = mat(i, 0);

        for(int j = 1; j < nCols; ++j) {
            minValue = std::min(minValue, mat(i, j));
        }

        mins[i] = minValue;
    }
    return mins;
}
m <- matrix(sample(1:20, 50000, replace = TRUE), ncol = 5)
res <- bench::mark(
    apply = apply(m, 1, min),
    docall = do.call(pmin, as.data.frame(m)),
    Rcpp_optim = rowMins_optim(m)
)
res
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 apply        4.75ms   4.88ms      203.     391KB     55.7
#> 2 docall     184.79µs 190.39µs     5024.     511KB     41.4
#> 3 Rcpp_optim  36.49µs  40.31µs    24519.     471KB    221.

Would be quite a speedup but not sure if this justifies making the pkg a compiled one?

@chainsawriot
Copy link
Contributor Author

@schochastics Let's make the pkg a compiled one. It's like inevitable these days.

(I am cracking dfm() and I think for that one the only viable solution is C++.)

@schochastics
Copy link
Member

Ok I can prepare a PR making the necessary changes

@chainsawriot
Copy link
Contributor Author

chainsawriot commented Nov 16, 2023

Another bottleneck; the memory footprint is also kind of unacceptable.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
data_corpus_inaugural %>% tokens %>% tokens_tolower %>% tokens_compound(data_dictionary_LSD2015, concatenator = " ") -> toks
toks %>% tokens_proximity(c("ameri*")) -> tokp
bench::mark(dfm(tokp), dfm(toks), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dfm(tokp)   557.4ms  557.4ms      1.79    1.11GB     46.6
#> 2 dfm(toks)    13.3ms   13.9ms     49.9     11.1MB     16.0

Created on 2023-11-16 with reprex v2.0.2

Arguably dfm(tokp) is a slightly more difficult task than dfm(toks); and the original dfm() is muli-threaded (with RcppParallel (3.x) or TBB (4.x) in C++ code). But the difference is too great.

chainsawriot added a commit that referenced this issue Nov 16, 2023
chainsawriot added a commit that referenced this issue Nov 16, 2023
* dfm2 ref #20

* Make dfm2 default

* Remove the repeated line
chainsawriot pushed a commit that referenced this issue Nov 17, 2023
* reimplement row min in C

* added registration
@schochastics
Copy link
Member

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

Created on 2023-11-17 with reprex v2.0.2

also, for future reference see my blog and a discussion on mastodon

@chainsawriot
Copy link
Contributor Author

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

Created on 2023-11-17 with reprex v2.0.2

also, for future reference see my blog and a discussion on mastodon

With #44

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  56.9ms 58.8ms      12.9    98.9MB     47.9

Created on 2023-11-21 with reprex v2.0.2

The running time is 2x. But I think there is room for optz.

for (i in seq_along(x)) {
if (dn[i] %in% idx$docname) {
poss <- seq_len(nt[i])
matched_rows <- idx$docname == dn[i]
res <- mapply(cal_func, from = idx$from[matched_rows], to = idx$to[matched_rows], MoreArgs = list("poss" = poss))
if (get_min) {
output[[i]] <- row_mins_c(res) + count_from
} else {
output[[i]] <- res
}
} else {
output[[i]] <- rep(nt[i] + count_from, nt[i])
}
}

These lines are pure R.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants