Skip to content

Commit

Permalink
update of package
Browse files Browse the repository at this point in the history
  • Loading branch information
BERENZ committed Dec 4, 2023
1 parent dd2827f commit 2f09289
Show file tree
Hide file tree
Showing 20 changed files with 140 additions and 110 deletions.
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/files-pane.pper
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@
"ascending": false
}
],
"path": "~/git/nauka/ncn-foreigners/software/blocking/R"
"path": "~/git/nauka/ncn-foreigners/software/blocking/inst/tinytest"
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/source-pane.pper
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"activeTab": 3,
"activeTab": 2,
"activeTabSourceWindow0": 0
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/pcs/workbench-pane.pper
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"TabSet1": 3,
"TabSet2": 0,
"TabSet2": 3,
"TabZoom": {}
}
2 changes: 1 addition & 1 deletion .Rproj.user/E3DB6272/sources/prop/B8117F7C
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"source_window_id": "",
"Source": "Source",
"cursorPosition": "19,11",
"cursorPosition": "17,18",
"scrollLine": "0"
}
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Package: blocking
Type: Package
Title: Blocking records for record linkage / data deduplication
Title: Blocking records for record linkage / entity resolution
Version: 0.1.0
Authors@R:
c(person(given = "Maciej",
family = "Beręsewicz",
role = c("aut", "cre"),
email = "maciej.beresewicz@ue.poznan.pl",
comment = c(ORCID = "0000-0002-8281-4301")))
Description: A small R package that uses various approximate nearest neighbours algorithms to block records for data deduplication / record linkage / entity resolution.
Description: An R package that uses various approximate nearest neighbours algorithms and graphs to block records for data deduplication / record linkage / entity resolution.
License: GPL-3
Encoding: UTF-8
LazyData: true
Expand Down
16 changes: 5 additions & 11 deletions R/blocking.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#' @author Maciej Beręsewicz
#'
#' @description
#' Function creates shingles (strings with 2 characters, default), applies approximate nearest neighbour (ANN) algorithms via the [rnndescent], [RcppHNSW], [RcppAnnoy] and [mlpack] packages,
#' Function creates shingles (strings with 2 characters, default), applies approximate nearest neighbour (ANN) algorithms via the [rnndescent], RcppHNSW, [RcppAnnoy] and [mlpack] packages,
#' and creates blocks using graphs via [igraph].
#'
#' @param x reference data (a character vector or a matrix),
Expand Down Expand Up @@ -58,14 +58,6 @@
#'
#' result
#'
#' ## an example using RcppAnnoy
#'
#' result_annoy <- blocking(x = df_example$txt,
#' ann = "annoy",
#' distance = "angular")
#'
#' result_annoy
#'
#' ## an example using mlpack::lsh
#'
#' result_lsh <- blocking(x = df_example$txt,
Expand Down Expand Up @@ -100,7 +92,9 @@ blocking <- function(x,
"lsh" = NULL,
"kd" = NULL)

stopifnot("Only character or matrix x is supported" = is.character(x) | is.matrix(x))
stopifnot("Only character, dense or sparse (dgCMatrix) matrix x is supported" =
is.character(x) | is.matrix(x) | inherits(x, "Matrix"))

if (!is.null(ann_write)) {
stopifnot("Path provided in the `ann_write` is incorrect" = file.exists(ann_write) )
}
Expand Down Expand Up @@ -144,7 +138,7 @@ blocking <- function(x,
}

## add verification if x and y is a sparse matrix
if (is.matrix(x)) {
if (is.matrix(x) | inherits(x, "Matrix")) {
l_dtm <- x
l_dtm_y <- y
} else {
Expand Down
2 changes: 1 addition & 1 deletion R/method_hnsw.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
#' @importFrom utils setTxtProgressBar
#' @importFrom utils txtProgressBar
#'
#' @title An internal function to use HNSW algorithm via the [RcppHNSW] package.
#' @title An internal function to use HNSW algorithm via the RcppHNSW package.
#' @author Maciej Beręsewicz
#'
#' @param x deduplication or reference data,
Expand Down
6 changes: 3 additions & 3 deletions R/reclin2_pair_ann.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@
#' @author Maciej Beręsewicz
#'
#' @description
#' Function for the integration with the [reclin2] package. The function is based on [reclin2::pair_minsim()] and reuses some of its source code.
#' Function for the integration with the reclin2 package. The function is based on [reclin2::pair_minsim()] and reuses some of its source code.
#'
#' @param x reference data (a data.frame or data.table),
#' @param y query data (a data.frame or data.table, default NULL),
#' @param x reference data (a data.frame or a data.table),
#' @param y query data (a data.frame or a data.table, default NULL),
#' @param on a character vector with column names for the ANN search,
#' @param on_blocking blocking variables (currently not supported),
#' @param deduplication whether deduplication should be performed (default TRUE),
Expand Down
29 changes: 18 additions & 11 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ knitr::opts_chunk$set(

## Description

A small package used to block records for data deduplication and record linkage (entity resolution) based on [approximate nearest neighbours algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via `igraph`).
An R package that aims to block records for data deduplication and record linkage (a.k.a. entity resolution) based on [approximate nearest neighbours algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via the `igraph` package).

Currently supports the following R packages that binds to specific ANN algorithms
Currently supports the following R packages that binds to specific ANN algorithms:

+ [rnndescent](https://cran.r-project.org/package=rnndescent) (default),
+ [RcppHNSW](https://cran.r-project.org/package=RcppHNSW),
+ [rnndescent](https://cran.r-project.org/package=rnndescent) (default, very powerful, supports sparse matrices),
+ [RcppHNSW](https://cran.r-project.org/package=RcppHNSW) (powerful but does not support sparse matrices),
+ [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
+ [mlpack](https://cran.r-project.org/package=RcppAnnoy) (see `mlpack::lsh` and `mlpack::knn`).

Expand All @@ -37,7 +37,7 @@ Work on this package is supported by the National Science Centre, OPUS 22 grant

## Installation

You can install the development version of `blocking` from GitHub with:
You can install the development version of the `blocking` package from GitHub with:

```{r, eval=FALSE}
# install.packages("remotes") # uncomment if needed
Expand All @@ -53,7 +53,7 @@ library(blocking)
library(reclin2)
```

Generate simple data with two groups.
Generate simple data with two groups (`df_example`) and reference data (`df_base`).

```{r}
df_example <- data.frame(txt = c(
Expand All @@ -73,22 +73,29 @@ df_example
df_base
```

Deduplication using blocking
Deduplication using `blocking` function. Output contains information about:
+ the method used (where `nnd` which refers to the NN descent algorithm),
+ number of blocks created (here 2 blocks),
+ number of columns used for blocking, i.e. how many shingles were created by `text2vec` package (here 28),
+ reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.5714 which means blocking reduces comparison by over 57%).

```{r}
blocking_result <- blocking(x = df_example$txt)
## data frame with indices and block
blocking_result
```

Table with blocking
Table with blocking which contains:

+ row numbers from the original data
+ block number (integers),
+ distance (from the ANN algorithm).

```{r}
blocking_result$result
```

Deduplication followed by the `reclin2` package

Deduplication using the `pair_ann` function for integration with the `reclin2` package. Here I use the pipeline that can be used with the `reclin2` package.

```{r}
pair_ann(x = df_example, on = "txt") |>
Expand All @@ -97,7 +104,7 @@ pair_ann(x = df_example, on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
```
Record linkage
Record linkage using the same function where `df_base` is the "register" and `df_example` is the reference (query data).

```{r}
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
Expand Down
44 changes: 31 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,19 @@ coverage](https://codecov.io/gh/ncn-foreigners/blocking/branch/main/graph/badge.

## Description

A small package used to block records for data deduplication and record
linkage (entity resolution) based on [approximate nearest neighbours
algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search)
and graphs (via `igraph`).
An R package that aims to block records for data deduplication and
record linkage (a.k.a. entity resolution) based on [approximate nearest
neighbours algorithms
(ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs
(via the `igraph` package).

Currently supports the following R packages that binds to specific ANN
algorithms
algorithms:

- [rnndescent](https://cran.r-project.org/package=rnndescent) (default),
- [RcppHNSW](https://cran.r-project.org/package=RcppHNSW),
- [rnndescent](https://cran.r-project.org/package=rnndescent) (default,
very powerful, supports sparse matrices),
- [RcppHNSW](https://cran.r-project.org/package=RcppHNSW) (powerful but
does not support sparse matrices),
- [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
- [mlpack](https://cran.r-project.org/package=RcppAnnoy) (see
`mlpack::lsh` and `mlpack::knn`).
Expand All @@ -36,7 +39,8 @@ Work on this package is supported by the National Science Centre, OPUS

## Installation

You can install the development version of `blocking` from GitHub with:
You can install the development version of the `blocking` package from
GitHub with:

``` r
# install.packages("remotes") # uncomment if needed
Expand All @@ -58,7 +62,8 @@ library(reclin2)
#> identical
```

Generate simple data with two groups.
Generate simple data with two groups (`df_example`) and reference data
(`df_base`).

``` r
df_example <- data.frame(txt = c(
Expand Down Expand Up @@ -90,7 +95,13 @@ df_base
#> 2 kowalskijan
```

Deduplication using blocking
Deduplication using `blocking` function. Output contains information
about: + the method used (where `nnd` which refers to the NN descent
algorithm), + number of blocks created (here 2 blocks), + number of
columns used for blocking, i.e. how many shingles were created by
`text2vec` package (here 28), + reduction ratio, i.e. how large is the
reduction of comparison pairs (here 0.5714 which means blocking reduces
comparison by over 57%).

``` r
blocking_result <- blocking(x = df_example$txt)
Expand All @@ -107,7 +118,11 @@ blocking_result
#> 2
```

Table with blocking
Table with blocking which contains:

- row numbers from the original data
- block number (integers),
- distance (from the ANN algorithm).

``` r
blocking_result$result
Expand All @@ -123,7 +138,9 @@ blocking_result$result
#> 8: 6 5 2 0.08333336
```

Deduplication followed by the `reclin2` package
Deduplication using the `pair_ann` function for integration with the
`reclin2` package. Here I use the pipeline that can be used with the
`reclin2` package.

``` r
pair_ann(x = df_example, on = "txt") |>
Expand All @@ -148,7 +165,8 @@ pair_ann(x = df_example, on = "txt") |>
#> 10: 8 6 pythonmonty monty
```

Record linkage
Record linkage using the same function where `df_base` is the “register”
and `df_example` is the reference (query data).

``` r
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
Expand Down
5 changes: 5 additions & 0 deletions inst/tinytest/test_blocking.R
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,8 @@ expect_silent(
)


## printing

expect_silent(
print(blocking(x = df_example$txt))
)
13 changes: 13 additions & 0 deletions inst/tinytest/test_hnsw.R
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,16 @@ expect_stdout(
)


### checks sparse data

expect_silent(
blocking(x = df_example$txt,
ann = "hnsw",
control_ann = controls_ann(sparse=TRUE))
)

expect_silent(
blocking(x = Matrix::Matrix(mat_y),
ann = "hnsw",
control_ann = controls_ann(sparse=TRUE))
)
19 changes: 6 additions & 13 deletions man/blocking.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 6 additions & 6 deletions man/controls_ann.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/controls_txt.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 2f09289

Please sign in to comment.