-
Notifications
You must be signed in to change notification settings - Fork 0
Optimization #20
Comments
To my surprise, the slowest part is quanteda.proximity/R/get_dist.R Line 17 in 34377dc
|
(Disclaimer: Not familiar with quanteda) It also depends on the Keyword used? R> bench::mark(
war = tokens_proximity(toks, c("war")),
senate = tokens_proximity(toks, c("Senate")), check=FALSE
)
Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list>
1 war 2.63s 2.63s 0.380 6.96MB 12.9 1 34 2.63s <NULL> <Rprofmem>
2 senate 594.03ms 594.03ms 1.68 4.22MB 11.8 1 7 594.03ms <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list> |
@chainsawriot suggestion for speedup: #current version
f <- function(){
res <- lapply(target_idx, .cal_dist, poss = poss)
purrr::map_dbl(poss, .get_min, x = res) + count_from
}
#new version
g <- function(){
res <- sapply(target_idx, .cal_dist, poss = poss)
apply(res,1,min)+count_from
}
bench::mark(
f(),
g(),iterations = 100
)
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list>
1 f() 62.71ms 64.06ms 15.6 48.2KB 9.57 62 38 3.97s <dbl> <Rprofmem>
2 g() 1.07ms 1.15ms 852. 132.5KB 17.4 98 2 114.99ms <dbl> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list> Happy to make a PR if this would work for you |
@schochastics Yes. Because the calculation of proximity is skipped when the keyword is not found in the text. Try a word that occurs in all articles, e.g. "a". It takes 12s on my machine. |
with #22 and #23, we probably reached what is possible require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))" 94ms 106ms 7.28 150MB 47.3
Rprof()
tmp <- tokens_proximity(toks, c("a"))
Rprof(NULL)
summaryRprof()
#> $by.self
#> self.time self.pct total.time total.pct
#> "<Anonymous>" 0.02 33.33 0.06 100.00
#> "as.vector" 0.02 33.33 0.02 33.33
#> "FUN" 0.02 33.33 0.02 33.33
#>
#> $by.total
#> total.time total.pct self.time self.pct
#> "<Anonymous>" 0.06 100.00 0.02 33.33
#> ".f" 0.06 100.00 0.00 0.00
#> "block_exec" 0.06 100.00 0.00 0.00
#> "call_block" 0.06 100.00 0.00 0.00
#> "call_with_cleanup" 0.06 100.00 0.00 0.00
#> "do.call" 0.06 100.00 0.00 0.00
#> "doTryCatch" 0.06 100.00 0.00 0.00
#> "eng_r" 0.06 100.00 0.00 0.00
#> "eval_with_user_handlers" 0.06 100.00 0.00 0.00
#> "eval" 0.06 100.00 0.00 0.00
#> "evaluate_call" 0.06 100.00 0.00 0.00
#> "evaluate::evaluate" 0.06 100.00 0.00 0.00
#> "evaluate" 0.06 100.00 0.00 0.00
#> "get_proximity" 0.06 100.00 0.00 0.00
#> "handle_error" 0.06 100.00 0.00 0.00
#> "handle" 0.06 100.00 0.00 0.00
#> "in_dir" 0.06 100.00 0.00 0.00
#> "in_input_dir" 0.06 100.00 0.00 0.00
#> "knitr::knit" 0.06 100.00 0.00 0.00
#> "map_" 0.06 100.00 0.00 0.00
#> "process_file" 0.06 100.00 0.00 0.00
#> "process_group.block" 0.06 100.00 0.00 0.00
#> "process_group" 0.06 100.00 0.00 0.00
#> "purrr::map" 0.06 100.00 0.00 0.00
#> "rmarkdown::render" 0.06 100.00 0.00 0.00
#> "saveRDS" 0.06 100.00 0.00 0.00
#> "timing_fn" 0.06 100.00 0.00 0.00
#> "tokens_proximity" 0.06 100.00 0.00 0.00
#> "try" 0.06 100.00 0.00 0.00
#> "tryCatch" 0.06 100.00 0.00 0.00
#> "tryCatchList" 0.06 100.00 0.00 0.00
#> "tryCatchOne" 0.06 100.00 0.00 0.00
#> "with_indexed_errors" 0.06 100.00 0.00 0.00
#> "withCallingHandlers" 0.06 100.00 0.00 0.00
#> "withVisible" 0.06 100.00 0.00 0.00
#> "as.vector" 0.02 33.33 0.02 33.33
#> "FUN" 0.02 33.33 0.02 33.33
#> "as.data.frame.matrix" 0.02 33.33 0.00 0.00
#> "as.data.frame" 0.02 33.33 0.00 0.00
#> "lapply" 0.02 33.33 0.00 0.00
#> "sapply" 0.02 33.33 0.00 0.00
#>
#> $sample.interval
#> [1] 0.02
#>
#> $sampling.time
#> [1] 0.06 Created on 2023-11-16 with reprex v2.0.2 |
Ok this is a fun exercise. The most optimized C++ I could come up with. #include <Rcpp.h>
#include <algorithm>
// [[Rcpp::export]]
Rcpp::NumericVector rowMins_optim(const Rcpp::NumericMatrix& mat) {
int nRows = mat.nrow();
int nCols = mat.ncol();
Rcpp::NumericVector mins(nRows);
for(int i = 0; i < nRows; ++i) {
double minValue = mat(i, 0);
for(int j = 1; j < nCols; ++j) {
minValue = std::min(minValue, mat(i, j));
}
mins[i] = minValue;
}
return mins;
} m <- matrix(sample(1:20, 50000, replace = TRUE), ncol = 5)
res <- bench::mark(
apply = apply(m, 1, min),
docall = do.call(pmin, as.data.frame(m)),
Rcpp_optim = rowMins_optim(m)
)
res
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 apply 4.75ms 4.88ms 203. 391KB 55.7
#> 2 docall 184.79µs 190.39µs 5024. 511KB 41.4
#> 3 Rcpp_optim 36.49µs 40.31µs 24519. 471KB 221. Would be quite a speedup but not sure if this justifies making the pkg a compiled one? |
@schochastics Let's make the pkg a compiled one. It's like inevitable these days. (I am cracking |
Ok I can prepare a PR making the necessary changes |
Another bottleneck; the memory footprint is also kind of unacceptable. require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
data_corpus_inaugural %>% tokens %>% tokens_tolower %>% tokens_compound(data_dictionary_LSD2015, concatenator = " ") -> toks
toks %>% tokens_proximity(c("ameri*")) -> tokp
bench::mark(dfm(tokp), dfm(toks), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dfm(tokp) 557.4ms 557.4ms 1.79 1.11GB 46.6
#> 2 dfm(toks) 13.3ms 13.9ms 49.9 11.1MB 16.0 Created on 2023-11-16 with reprex v2.0.2 Arguably |
with #26 require(quanteda)
require(quanteda.proximity)
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))" 32.8ms 35.4ms 23.5 91.6MB 80.2 Created on 2023-11-17 with reprex v2.0.2 also, for future reference see my blog and a discussion on mastodon |
With #44 require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
require(quanteda.proximity)
#> Loading required package: quanteda.proximity
toks <- data_corpus_inaugural %>% tokens()
bench::mark(tokens_proximity(toks, c("a")))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "tokens_proximity(toks, c(\"a\"))" 56.9ms 58.8ms 12.9 98.9MB 47.9 Created on 2023-11-21 with reprex v2.0.2 The running time is 2x. But I think there is room for optz. quanteda.proximity/R/get_dist.R Lines 25 to 38 in 6272667
These lines are pure R. |
Takes almost 4s for only 59 documents.
The text was updated successfully, but these errors were encountered: