Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a few optimizations to the matchindex C++ function: #695

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stefvanbuuren
Copy link
Member

@stefvanbuuren stefvanbuuren commented Feb 27, 2025

Changes

The matchindex C++ function has the following changes:

  • Use d[x] instead of d(x) for NumericVector
  • Use std::clamp() instead of manual if conditions for k
  • Remove redundant std::vector ysort(n1)
  • Use std::distance() instead of unnecessary iterator assignments
  • Use seq_len(k) directly instead of creating an extra IntegerVector kv(k)
  • Avoid redundant copies of sampled indices
  • Replace unnecessary lambda [yshuf] capture with direct reference
  • Use std::clamp() for restricting k values

Correctness

The result is exactly the same as in the original:

> # Inputs need not be sorted
> d <- c(-5, 5, 0, 10, 12)
> t <- c(-6, -4, 0, 2, 4, -2, 6)
> 
> # Index (in vector d) of closest match
> set.seed(1)
> idx <- matchindex(d, t)
> idx
[1] 5 2 2 1 4 3 1
> 
> # Compare with optimized version
> set.seed(1)
> idx <- matchindex_optimized(d, t)
> idx
[1] 5 2 2 1 4 3 1

Speed-up

library(Rcpp)
library(microbenchmark)

# Load original and optimized versions of matchindex()
sourceCpp("original_matchindex.cpp")
sourceCpp("optimized_matchindex.cpp")

# Generate test data
set.seed(42)
n_d <- 100000  # Number of donor cases
n_t <- 10000   # Number of target cases

d <- runif(n_d, -10, 10)  # Random donor values
t <- runif(n_t, -10, 10)  # Random target values
k <- 5  # Number of nearest neighbors to sample

benchmark_results <- microbenchmark(
  original = matchindex(d, t, k),
  optimized = matchindex_optimized(d, t, k),
  times = 10
)

summary(benchmark_results)

Result:

       expr  min   lq mean median   uq  max neval cld
1  original 14.6 14.7 15.7   14.8 15.6 19.2    10  a 
2 optimized 13.9 14.1 14.5   14.6 15.0 15.0    10   b

In typical use cases, changes result in a speed-up of about 10%.

- Use d[x] instead of d(x) for NumericVector
- Use std::clamp() instead of manual if conditions for k
- Remove redundant std::vector<double> ysort(n1)
- Use std::distance() instead of unnecessary iterator assignments
- Use seq_len(k) directly instead of creating an extra IntegerVector kv(k)
- Avoid redundant copies of sampled indices
- Replace unnecessary lambda [yshuf] capture with direct reference
//' # - Use std::clamp() for restricting k values

In typical use cases, changes result in a speed-up of about 10%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant