-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
135 lines (98 loc) · 4.63 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
output: github_document
---
<!-- badges: start -->
[![R-CMD-check](https://github.com/ncn-foreigners/blocking/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ncn-foreigners/blocking/actions/workflows/R-CMD-check.yaml)
[![test-coverage](https://github.com/ncn-foreigners/blocking/actions/workflows/test-coverage.yaml/badge.svg)](https://github.com/ncn-foreigners/blocking/actions/workflows/test-coverage.yaml)
[![Codecov test coverage](https://codecov.io/gh/ncn-foreigners/blocking/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ncn-foreigners/blocking?branch=main)
<!-- badges: end -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Overview
## Warning!
The package is still being developed, so the API and features may change.
## Description
This R package is designed to block records for data deduplication and record linkage (also known as entity resolution) using [approximate nearest neighbours algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via the `igraph` package).
It supports the following R packages that bind to specific ANN algorithms:
+ [rnndescent](https://cran.r-project.org/package=rnndescent) (default, very powerful, supports sparse matrices),
+ [RcppHNSW](https://cran.r-project.org/package=RcppHNSW) (powerful but does not support sparse matrices),
+ [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
+ [mlpack](https://cran.r-project.org/package=RcppAnnoy) (see `mlpack::lsh` and `mlpack::knn`).
The package can be used with the [reclin2](https://cran.r-project.org/package=reclin2) package via the `blocking::pair_ann` function.
## Funding
Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.
## Installation
Install the GitHub blocking package with:
```{r, eval=FALSE}
# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")
```
## Basic usage
Load packages for the examples
```{r}
library(blocking)
library(reclin2)
```
Generate simple data with three groups (`df_example`) and reference data (`df_base`).
```{r}
df_example <- data.frame(txt = c(
"jankowalski",
"kowalskijan",
"kowalskimjan",
"kowaljan",
"montypython",
"pythonmonty",
"cyrkmontypython",
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))
df_example
df_base
```
Deduplication using the `blocking` function. Output contains information:
+ the method used (where `nnd` which refers to the NN descent algorithm),
+ number of blocks created (here 2 blocks),
+ number of columns used for blocking, i.e. how many shingles were created by `text2vec` package (here 28),
+ reduction ratio, i.e. how large is the reduction of comparison pairs (here 0.5714 which means blocking reduces comparison by over 57%).
```{r}
blocking_result <- blocking(x = df_example$txt)
blocking_result
```
Table with blocking results contains:
+ row numbers from the original data,
+ block number (integers),
+ distance (from the ANN algorithm).
```{r}
blocking_result$result
```
Deduplication using the `pair_ann` function for integration with the `reclin2` package. Use the pipeline with the `reclin2` package.
```{r}
pair_ann(x = df_example, on = "txt") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
```
Linking records using the same function where `df_base` is the "register" and `df_example` is the reference (data).
```{r}
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
```
## See also
See section `Data Integration (Statistical Matching and Record Linkage)` in [the Official Statistics Task View](https://cran.r-project.org/web/views/OfficialStatistics.html).
Packages that allow blocking:
+ [klsh](https://CRAN.R-project.org/package=klsh) -- k-means locality sensitive hashing,
+ [reclin2](https://CRAN.R-project.org/package=reclin2) -- `pair_blocking`, `pari_minsim` functions,
+ [fastLink](https://CRAN.R-project.org/package=fastLink) -- `blockData` function.
Other:
+ [clevr](https://CRAN.R-project.org/package=clevr) -- evaluation of clustering, helper functions.
+ [exchanger](https://github.com/cleanzr/exchanger) -- bayesian Entity Resolution with Exchangeable Random Partition Priors