Skip to content

Commit

Permalink
documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
BERENZ committed Dec 4, 2023
1 parent eacfbba commit dd2827f
Show file tree
Hide file tree
Showing 11 changed files with 95 additions and 95 deletions.
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/files-pane.pper
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@
"ascending": false
}
],
"path": "~/git/nauka/ncn-foreigners/software/blocking"
"path": "~/git/nauka/ncn-foreigners/software/blocking/R"
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/source-pane.pper
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"activeTab": 2,
"activeTab": 3,
"activeTabSourceWindow0": 0
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/workbench-pane.pper
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"TabSet1": 3,
"TabSet2": 4,
"TabSet2": 0,
"TabZoom": {}
}
43 changes: 22 additions & 21 deletions R/blocking.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,32 +17,33 @@
#' @author Maciej Beręsewicz
#'
#' @description
#' Function that creates shingles (strings with 2 characters), applies approximate nearest neighbour search using
#' [rnndescent], RcppHNSW, [RcppAnnoy] and [mlpack] and creates blocks using [igraph].
#' Function creates shingles (strings with 2 characters, default), applies approximate nearest neighbour (ANN) algorithms via the [rnndescent], [RcppHNSW], [RcppAnnoy] and [mlpack] packages,
#' and creates blocks using graphs via [igraph].
#'
#' @param x reference data (character vector or a matrix),
#' @param y query data (types the same), if not provided NULL by default,
#' @param x reference data (a character vector or a matrix),
#' @param y query data (a character vector or a matrix), if not provided NULL by default and thus deduplication is performed,
#' @param deduplication whether deduplication should be applied (default TRUE as y is set to NULL),
#' @param on variables for ann search (currently not supported),
#' @param on_blocking variables for blocking (currently not supported),
#' @param ann algorithm to be used for searching for ann (possible, \code{c("hnsw", "lsh", "annoy", "kd")}, default \code{"hnsw"}),
#' @param distance distance metric (default \code{cosine}),
#' @param ann_write writing an index to file. Two files will be created: 1) an index, 2) and txt file with column names (currently not supported),
#' @param ann_colnames testing
#' @param true_blocks matrix with true blocks to calculate evaluation metrics (all metrics from [igraph::compare()] are returned).
#' @param verbose whether log should be provided (0 = none, 1 = main, 2 = ann algorithms),
#' @param graph whether a graph should be returned,
#' @param seed seed for the algorithms,
#' @param n_threads number of threads used for the ann,
#' @param control_txt list of controls for text data,
#' @param control_ann list of controls for ann algorithms.
#' @param on variables for ANN search (currently not supported),
#' @param on_blocking variables for blocking records before ANN search (currently not supported),
#' @param ann algorithm to be used for searching for ann (possible, \code{c("nnd", "hnsw", "annoy", "lsh", "kd")}, default \code{"nnd"} which corresponds to nearest neighbour descent method),
#' @param distance distance metric (default \code{cosine}, more options are possible see details),
#' @param ann_write writing an index to file. Two files will be created: 1) an index, 2) and text file with column names,
#' @param ann_colnames file with column names if \code{x} or \code{y} are indices saved on the disk (currently not supported),
#' @param true_blocks matrix with true blocks to calculate evaluation metrics (standard metrics based on confusion matrix as well as all metrics from [igraph::compare()] are returned).
#' @param verbose whether log should be provided (0 = none, 1 = main, 2 = ANN algorithm verbose used),
#' @param graph whether a graph should be returned (default FALSE),
#' @param seed seed for the algorithms (for reproducibility),
#' @param n_threads number of threads used for the ANN algorithms and adding data for index and query,
#' @param control_txt list of controls for text data (passed only to [text2vec::itoken_parallel] or [text2vec::itoken]),
#' @param control_ann list of controls for the ANN algorithms.
#'
#' @returns Returns a list with containing:\cr
#' \itemize{
#' \item{\code{result} -- \code{data.frame} with indices (rows) of x, y and block}
#' \item{\code{ann} -- name of the ann algorithm used,}
#' \item{\code{metrics} -- metrics, if \code{true_blocks} is provided,}
#' \item{\code{colnames} -- variable names (colnames) used for search.}
#' \item{\code{result} -- \code{data.table} with indices (rows) of x, y, block and distance between points}
#' \item{\code{method} -- name of the ANN algorithm used,}
#' \item{\code{metrics} -- metrics for quality assessment, if \code{true_blocks} is provided,}
#' \item{\code{colnames} -- variable names (colnames) used for search,}
#' \item{\code{graph} -- \code{igraph} class object.}
#' }
#'
#' @examples
Expand Down
14 changes: 7 additions & 7 deletions R/controls.R
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
#' @title Controls for approximate nearest neighbours algoritms
#' @title Controls for approximate nearest neighbours algorithms
#'
#' @author Maciej Beręsewicz
#'
#' @description
#' Controls for ANN algorithms used in the package
#'
#' @param sparse whether sparse data should be used as an input for algorithms.
#' @param nnd parameters for [rnndescent::rnnd_build()] and [rnndescent::rnnd_query()].
#' @param hnsw parameters for [RcppHNSW::hnsw_build()] and [RcppHNSW::hnsw_search()].
#' @param lsh parameters for [mlpack::lsh()].
#' @param annoy parameters for [RcppAnnoy] package.
#' @param sparse whether sparse data should be used as an input for algorithms,
#' @param nnd parameters for [rnndescent::rnnd_build()] and [rnndescent::rnnd_query()],
#' @param hnsw parameters for [RcppHNSW::hnsw_build()] and [RcppHNSW::hnsw_search()],
#' @param lsh parameters for [mlpack::lsh()],
#' @param annoy parameters for [RcppAnnoy] package,
#' @param kd parameters for [mlpack::knn()] function.
#'
#' @returns Returns a list with parameters
Expand Down Expand Up @@ -62,7 +62,7 @@ controls_ann <- function(
kd = kd)
}

#' @title Controls for text data
#' @title Controls for processing text data
#'
#' @author Maciej Beręsewicz
#'
Expand Down
18 changes: 9 additions & 9 deletions R/method_annoy.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,17 @@
#' @importFrom methods new
#' @importFrom data.table data.table
#'
#' @title An internal function to use Annoy algorithm via RcppAnnoy.
#' @title An internal function to use Annoy algorithm via the [RcppAnnoy] package.
#' @author Maciej Beręsewicz
#'
#' @param x Deduplication or reference data.
#' @param y Query data.
#' @param k Number of neighbors to return.
#' @param distance distance metric
#' @param verbose If TRUE, log messages to the console.
#' @param seed seed for the pseudo-random numbers algorithm.
#' @param path path to write the index.
#' @param control controls for \code{lsh} or \code{kd}.
#' @param x deduplication or reference data,
#' @param y query data,
#' @param k number of neighbours to return,
#' @param distance distance metric,
#' @param verbose if TRUE, log messages to the console,
#' @param seed seed for the pseudo-random numbers algorithm,
#' @param path path to write the index,
#' @param control controls for \code{new} or \code{build} methods for [RcppAnnoy].
#'
#' @description
#' See details of the [RcppAnnoy] package.
Expand Down
20 changes: 10 additions & 10 deletions R/method_hnsw.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,20 @@
#' @importFrom utils setTxtProgressBar
#' @importFrom utils txtProgressBar
#'
#' @title An internal function to use hnsw algorithm via RcppHNSW.
#' @title An internal function to use HNSW algorithm via the [RcppHNSW] package.
#' @author Maciej Beręsewicz
#'
#' @param x Deduplication or reference data.
#' @param y Query data.
#' @param k Number of neighbors to return.
#' @param distance Type of distance to calculate.
#' @param verbose If TRUE, log messages to the console.
#' @param n_threads Maximum number of threads to use.
#' @param path path to write the index.
#' @param control Controls for the HNSW algorithm
#' @param x deduplication or reference data,
#' @param y query data,
#' @param k number of neighbours to return,
#' @param distance type of distance to calculate,
#' @param verbose if TRUE, log messages to the console,
#' @param n_threads Maximum number of threads to use,
#' @param path path to write the index,
#' @param control controls for the HNSW algorithm.
#'
#' @description
#' See details of [RcppHNSW::hnsw_build] and [RcppHNSW::hnsw_search]
#' See details of [RcppHNSW::hnsw_build] and [RcppHNSW::hnsw_search].
#'
#'
method_hnsw <- function(x,
Expand Down
18 changes: 9 additions & 9 deletions R/method_mlpack.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@
#' @importFrom mlpack knn
#' @importFrom data.table data.table
#'
#' @title An internal function to use algorthms from the mlpack package.
#' @title An internal function to use the LSH and KD-tree algorithm via the [mlpack] package.
#' @author Maciej Beręsewicz
#'
#' @param x Deduplication or reference data.
#' @param y Query data.
#' @param algo Which algorithm should be used. Possible: \code{lsh} or \code{kd}.
#' @param k Number of neighbors to return.
#' @param verbose If TRUE, log messages to the console.
#' @param seed seed for the pseudo-random numbers algorithm.
#' @param path path to write the index.
#' @param control controls for \code{lsh} or \code{kd}.
#' @param x deduplication or reference data,
#' @param y query data,
#' @param algo which algorithm should be used: \code{lsh} or \code{kd},
#' @param k number of neighbours to return,
#' @param verbose if TRUE, log messages to the console,
#' @param seed seed for the pseudo-random numbers algorithm,
#' @param path path to write the index,
#' @param control controls for the \code{lsh} or \code{kd} algorithms.
#'
#' @description
#' See details of [mlpack::lsh] and [mlpack::knn]
Expand Down
19 changes: 9 additions & 10 deletions R/method_nnd.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
#' @importFrom rnndescent rnnd_query
#' @importFrom data.table data.table
#'
#' @title An internal function to use the nnd algorithm via rnndescent package
#' @title An internal function to use the NN descent algorithm via the [rnndescent] package.
#' @author Maciej Beręsewicz
#'
#' @param x Deduplication or reference data.
#' @param y Query data.
#' @param k Number of neighbours to return.
#' @param distance Type of distance to calculate.
#' @param verbose If TRUE, log messages to the console.
#' @param n_threads Maximum number of threads to use.
#' @param control Controls for the NND algorithm
#' @param x deduplication or reference data,
#' @param y query data,
#' @param k number of neighbours to return,
#' @param distance type of distance to calculate,
#' @param verbose if TRUE, log messages to the console,
#' @param n_threads maximum number of threads to use,
#' @param control controls for the NN descent algorithm.
#'
#' @description
#' See details of [rnndescent::rnnd_build] and [rnndescent::rnnd_query]
#' See details of [rnndescent::rnnd_build] and [rnndescent::rnnd_query].
#'
#'
method_nnd <- function(x,
Expand Down Expand Up @@ -78,6 +78,5 @@ method_nnd <- function(x,
dist = l_1nn$dist[,k])



l_df
}
18 changes: 9 additions & 9 deletions R/reclin2_pair_ann.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@
#' @author Maciej Beręsewicz
#'
#' @description
#' Function for the integration with the reclin2 package. The function is based on [reclin2::pair_minsim()] and reuses some of its source code.
#' Function for the integration with the [reclin2] package. The function is based on [reclin2::pair_minsim()] and reuses some of its source code.
#'
#' @param x x
#' @param y y
#' @param on variables for ann search
#' @param on_blocking blocking variables
#' @param deduplication deduplication
#' @param keep_block whether to keep block in the set
#' @param add_xy whether to add x and y
#' @param x reference data (a data.frame or data.table),
#' @param y query data (a data.frame or data.table, default NULL),
#' @param on a character vector with column names for the ANN search,
#' @param on_blocking blocking variables (currently not supported),
#' @param deduplication whether deduplication should be performed (default TRUE),
#' @param keep_block whether to keep the block variable in the set,
#' @param add_xy whether to add x and y,
#' @param ... arguments passed to [blocking::blocking()] function.
#'
#'
#' @returns Returns a [data.table] with two columns \code{.x} and \code{.y}. Columns \code{.x} and \code{.y} are row numbers from data.frames x and y respectively. This data.table is also of a class \code{pairs} which allows for integration witn the [reclin2::compare_pairs()] package.
#' @returns Returns a [data.table] with two columns \code{.x} and \code{.y}. Columns \code{.x} and \code{.y} are row numbers from data.frames x and y respectively. Returning data.table is also of a class \code{pairs} which allows for integration with the [reclin2::compare_pairs()] package.
#'
#' @examples
#'
Expand Down
34 changes: 17 additions & 17 deletions man/blocking.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit dd2827f

Please sign in to comment.