Optimization #20

chainsawriot · 2023-11-15T16:41:59Z

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens
bench::mark(tokens_proximity(toks, c("war")))

Takes almost 4s for only 59 documents.

chainsawriot · 2023-11-16T09:39:00Z

To my surprise, the slowest part is get_min (not .cal_dist)

quanteda.proximity/R/get_dist.R

Line 17 in 34377dc

return(purrr::map_dbl(poss, .get_min, x = res) + count_from)

schochastics · 2023-11-16T09:45:20Z

(Disclaimer: Not familiar with quanteda)

It also depends on the Keyword used?

R> bench::mark(
       war = tokens_proximity(toks, c("war")),
       senate = tokens_proximity(toks, c("Senate")), check=FALSE
   )
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 war           2.63s    2.63s     0.380    6.96MB     12.9     1    34      2.63s <NULL> <Rprofmem>
2 senate     594.03ms 594.03ms     1.68     4.22MB     11.8     1     7   594.03ms <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

schochastics · 2023-11-16T10:16:34Z

@chainsawriot suggestion for speedup:

#current version
f <- function(){
     res <- lapply(target_idx, .cal_dist, poss = poss)
     purrr::map_dbl(poss, .get_min, x = res) + count_from
}
#new version
g <- function(){
    res <- sapply(target_idx, .cal_dist, poss = poss)
    apply(res,1,min)+count_from
}
 
bench::mark(
    f(),
    g(),iterations = 100
)
# A tibble: 2 × 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 f()         62.71ms  64.06ms      15.6    48.2KB     9.57    62    38      3.97s <dbl>  <Rprofmem>
2 g()          1.07ms   1.15ms     852.    132.5KB    17.4     98     2   114.99ms <dbl>  <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>

Happy to make a PR if this would work for you

chainsawriot · 2023-11-16T10:24:39Z

@schochastics Yes. Because the calculation of proximity is skipped when the keyword is not found in the text. Try a word that occurs in all articles, e.g. "a". It takes 12s on my machine.

schochastics · 2023-11-16T12:28:38Z

with #22 and #23, we probably reached what is possible

require(quanteda)
require(quanteda.proximity)

toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"    94ms  106ms      7.28     150MB     47.3

Rprof()
tmp <- tokens_proximity(toks, c("a"))
Rprof(NULL)
summaryRprof()
#> $by.self
#>               self.time self.pct total.time total.pct
#> "<Anonymous>"      0.02    33.33       0.06    100.00
#> "as.vector"        0.02    33.33       0.02     33.33
#> "FUN"              0.02    33.33       0.02     33.33
#>
#> $by.total
#>                           total.time total.pct self.time self.pct
#> "<Anonymous>"                   0.06    100.00      0.02    33.33
#> ".f"                            0.06    100.00      0.00     0.00
#> "block_exec"                    0.06    100.00      0.00     0.00
#> "call_block"                    0.06    100.00      0.00     0.00
#> "call_with_cleanup"             0.06    100.00      0.00     0.00
#> "do.call"                       0.06    100.00      0.00     0.00
#> "doTryCatch"                    0.06    100.00      0.00     0.00
#> "eng_r"                         0.06    100.00      0.00     0.00
#> "eval_with_user_handlers"       0.06    100.00      0.00     0.00
#> "eval"                          0.06    100.00      0.00     0.00
#> "evaluate_call"                 0.06    100.00      0.00     0.00
#> "evaluate::evaluate"            0.06    100.00      0.00     0.00
#> "evaluate"                      0.06    100.00      0.00     0.00
#> "get_proximity"                 0.06    100.00      0.00     0.00
#> "handle_error"                  0.06    100.00      0.00     0.00
#> "handle"                        0.06    100.00      0.00     0.00
#> "in_dir"                        0.06    100.00      0.00     0.00
#> "in_input_dir"                  0.06    100.00      0.00     0.00
#> "knitr::knit"                   0.06    100.00      0.00     0.00
#> "map_"                          0.06    100.00      0.00     0.00
#> "process_file"                  0.06    100.00      0.00     0.00
#> "process_group.block"           0.06    100.00      0.00     0.00
#> "process_group"                 0.06    100.00      0.00     0.00
#> "purrr::map"                    0.06    100.00      0.00     0.00
#> "rmarkdown::render"             0.06    100.00      0.00     0.00
#> "saveRDS"                       0.06    100.00      0.00     0.00
#> "timing_fn"                     0.06    100.00      0.00     0.00
#> "tokens_proximity"              0.06    100.00      0.00     0.00
#> "try"                           0.06    100.00      0.00     0.00
#> "tryCatch"                      0.06    100.00      0.00     0.00
#> "tryCatchList"                  0.06    100.00      0.00     0.00
#> "tryCatchOne"                   0.06    100.00      0.00     0.00
#> "with_indexed_errors"           0.06    100.00      0.00     0.00
#> "withCallingHandlers"           0.06    100.00      0.00     0.00
#> "withVisible"                   0.06    100.00      0.00     0.00
#> "as.vector"                     0.02     33.33      0.02    33.33
#> "FUN"                           0.02     33.33      0.02    33.33
#> "as.data.frame.matrix"          0.02     33.33      0.00     0.00
#> "as.data.frame"                 0.02     33.33      0.00     0.00
#> "lapply"                        0.02     33.33      0.00     0.00
#> "sapply"                        0.02     33.33      0.00     0.00
#>
#> $sample.interval
#> [1] 0.02
#>
#> $sampling.time
#> [1] 0.06

^{Created on 2023-11-16 with reprex v2.0.2}

schochastics · 2023-11-16T13:41:37Z

Ok this is a fun exercise. The most optimized C++ I could come up with.

#include <Rcpp.h>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector rowMins_optim(const Rcpp::NumericMatrix& mat) {
    int nRows = mat.nrow();
    int nCols = mat.ncol();
    Rcpp::NumericVector mins(nRows);

    for(int i = 0; i < nRows; ++i) {
        double minValue = mat(i, 0);

        for(int j = 1; j < nCols; ++j) {
            minValue = std::min(minValue, mat(i, j));
        }

        mins[i] = minValue;
    }
    return mins;
}

m <- matrix(sample(1:20, 50000, replace = TRUE), ncol = 5)
res <- bench::mark(
    apply = apply(m, 1, min),
    docall = do.call(pmin, as.data.frame(m)),
    Rcpp_optim = rowMins_optim(m)
)
res
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 apply        4.75ms   4.88ms      203.     391KB     55.7
#> 2 docall     184.79µs 190.39µs     5024.     511KB     41.4
#> 3 Rcpp_optim  36.49µs  40.31µs    24519.     471KB    221.

Would be quite a speedup but not sure if this justifies making the pkg a compiled one?

chainsawriot · 2023-11-16T13:57:39Z

@schochastics Let's make the pkg a compiled one. It's like inevitable these days.

(I am cracking dfm() and I think for that one the only viable solution is C++.)

schochastics · 2023-11-16T14:07:10Z

Ok I can prepare a PR making the necessary changes

chainsawriot · 2023-11-16T14:12:22Z

Another bottleneck; the memory footprint is also kind of unacceptable.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
data_corpus_inaugural %>% tokens %>% tokens_tolower %>% tokens_compound(data_dictionary_LSD2015, concatenator = " ") -> toks
toks %>% tokens_proximity(c("ameri*")) -> tokp
bench::mark(dfm(tokp), dfm(toks), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dfm(tokp)   557.4ms  557.4ms      1.79    1.11GB     46.6
#> 2 dfm(toks)    13.3ms   13.9ms     49.9     11.1MB     16.0

^{Created on 2023-11-16 with reprex v2.0.2}

Arguably dfm(tokp) is a slightly more difficult task than dfm(toks)~~; and the original dfm() is muli-threaded (with RcppParallel (3.x) or TBB (4.x) in C++ code)~~. But the difference is too great.

* dfm2 ref #20 * Make dfm2 default * Remove the repeated line

* reimplement row min in C * added registration

schochastics · 2023-11-17T09:03:39Z

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

^{Created on 2023-11-17 with reprex v2.0.2}

also, for future reference see my blog and a discussion on mastodon

chainsawriot · 2023-11-21T21:03:28Z

with #26

require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  32.8ms 35.4ms      23.5    91.6MB     80.2

Created on 2023-11-17 with reprex v2.0.2

also, for future reference see my blog and a discussion on mastodon

With #44

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch:t> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))"  56.9ms 58.8ms      12.9    98.9MB     47.9

^{Created on 2023-11-21 with reprex v2.0.2}

The running time is 2x. But I think there is room for optz.

quanteda.proximity/R/get_dist.R

Lines 25 to 38 in 6272667

    
           for (i in seq_along(x)) { 
        
               if (dn[i] %in% idx$docname) { 
        
                   poss <- seq_len(nt[i]) 
        
                   matched_rows <- idx$docname == dn[i] 
        
                   res <- mapply(cal_func, from = idx$from[matched_rows], to = idx$to[matched_rows], MoreArgs = list("poss" = poss)) 
        
                   if (get_min) { 
        
                       output[[i]] <- row_mins_c(res) + count_from 
        
                   } else { 
        
                       output[[i]] <- res 
        
                   } 
        
               } else { 
        
                   output[[i]] <- rep(nt[i] + count_from, nt[i]) 
        
               } 
        
           }

These lines are pure R.

chainsawriot changed the title ~~Kind of slow~~ Optimization Nov 16, 2023

chainsawriot added a commit that referenced this issue Nov 16, 2023

dfm2 ref #20

11771b9

chainsawriot added a commit that referenced this issue Nov 16, 2023

Speed up dfm.tokens_with_proximity ref #20 (#25)

bc06964

* dfm2 ref #20 * Make dfm2 default * Remove the repeated line

chainsawriot pushed a commit that referenced this issue Nov 17, 2023

reimplement row min in C ref #20 (#26)

45b0dbd

* reimplement row min in C * added registration

chainsawriot mentioned this issue Nov 21, 2023

Use quanteda::index() ref #38 #44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization #20

Optimization #20

chainsawriot commented Nov 15, 2023

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

schochastics commented Nov 16, 2023 •

edited

Loading

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

schochastics commented Nov 16, 2023

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

chainsawriot commented Nov 16, 2023 •

edited

Loading

schochastics commented Nov 17, 2023

chainsawriot commented Nov 21, 2023

Optimization #20

Optimization #20

Comments

chainsawriot commented Nov 15, 2023

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

schochastics commented Nov 16, 2023 • edited Loading

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

schochastics commented Nov 16, 2023

chainsawriot commented Nov 16, 2023

schochastics commented Nov 16, 2023

chainsawriot commented Nov 16, 2023 • edited Loading

schochastics commented Nov 17, 2023

chainsawriot commented Nov 21, 2023

schochastics commented Nov 16, 2023 •

edited

Loading

chainsawriot commented Nov 16, 2023 •

edited

Loading