LinkOrgs: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn
What is LinkOrgs? | Installation | Tutorial | Comparison with Ground Truth | References | Documentation
Note: You can access a point-and-click implementation online here.
LinkOrgs is an R package for organizational record linkage that leverages half-a-billion open-collaborated records from LinkedIn. It provides multiple matching algorithms optimized for different use cases:
| Algorithm | Internet Required | ML-backend Required | Speed | Best For |
|---|---|---|---|---|
fuzzy |
No | No | Fast | Simple name matching |
bipartite |
Yes | No | Medium | Network-informed matching best for organizations having LinkedIn presence, ~2017 |
markov |
Yes | No | Medium | Network-informed matching best for organizations having LinkedIn presence, ~2017 |
ml |
Yes | Yes | Slower | High-accuracy semantic matching |
transfer |
Yes | Yes | Slower | Combined network + ML approach |
- Fuzzy matching (
algorithm="fuzzy"): Fast parallelized string distance matching using Jaccard, Jaro-Winkler, or other string distances - Network-based (
algorithm="bipartite"or"markov"): Uses LinkedIn's organizational network structure for improved accuracy - Machine learning (
algorithm="ml"): Transformer-based embeddings (requires JAX backend setup viaBuildBackend()) - Combined (
algorithm="markov"+DistanceMeasure="ml"): Network + ML hybrid approach
The most recent version of LinkOrgs can be installed directly from the repository using the devtools package
# install package
devtools::install_github("cjerzak/LinkOrgs-software/LinkOrgs")
The machine-learning based algorithm accessible via the algorithm="ml" option relies on jax. The network-based linkage approaches (algorithm="bipartite" and algorithm = "markov") do not require these packages. To setup the machine learning backend, you can call
# install ML backend
LinkOrgs::BuildBackend(conda = "auto")
Note that most package options require Internet access in order to download the saved machine learning model parameters and LinkedIn-based network information.
library(LinkOrgs)
# Sample data
x <- data.frame(org = c("Apple Inc", "Microsoft Corp"))
y <- data.frame(org = c("Apple", "Microsoft Corporation"))
# Link organizations using fuzzy matching
result <- LinkOrgs(x = x, y = y, by.x = "org", by.y = "org",
algorithm = "fuzzy", AveMatchNumberPerAlias = 2)
print(result)After installing the package, let's get some experience with it in an example.
# load in package
library(LinkOrgs)
# set up synthetic data for the merge
x_orgnames <- c("apple","oracle","enron inc.","mcdonalds corporation")
y_orgnames <- c("apple corp","oracle inc","enron","mcdonalds")
x <- data.frame("orgnames_x"=x_orgnames)
y <- data.frame("orgnames_y"=y_orgnames)
After creating these synthetic datasets, we're now ready to merge them. We can do this in a number of ways. See the paper listed in the reference for information about which may be most useful for your merge task.
First, we'll try a merge using parallelized fast fuzzy matching via LinkOrgs::LinkOrgs. A key hyperparameter is AveMatchNumberPerAlias, which controls the number of matches per alias (in practice, we calibrate this with an initial random sampling step, the exact matched dataset size won't be a perfect multiple of AveMatchNumberPerAlias). Here, we set AveMatchNumberPerAlias = 10 so that all observations in this small dataset are potentially matched against all others for illustration purposes.
# perform merge using (parallelized) fast fuzzy matching
# LinkOrgs::LinkOrgs can be readily used for non-organizational name matches
# when doing pure parallelized fuzzy matching
z_linked_fuzzy <- LinkOrgs::LinkOrgs(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
algorithm = "fuzzy",
DistanceMeasure = "jaccard",
AveMatchNumberPerAlias = 4)
Next, we'll try using some of the LinkedIn-calibrated approaches using LinkOrgs::LinkOrgs:
# perform merge using bipartite network approach
z_linked_bipartite <- LinkOrgs(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
AveMatchNumberPerAlias = 10,
algorithm = "bipartite",
DistanceMeasure = "jaccard")
# perform merge using markov network approach
z_linked_markov <- LinkOrgs(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
AveMatchNumberPerAlias = 10,
algorithm = "markov",
DistanceMeasure = "jaccard")
# Build backend for ML model (run once before using algorithm="ml")
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env", conda = "auto")
# If conda = "auto" fails, specify the path explicitly:
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env",
# conda = "/path/to/miniforge3/bin/python")
# perform merge using a machine learning approach
z_linked_ml <- LinkOrgs(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
AveMatchNumberPerAlias = 10,
algorithm = "ml", ml_version = "v1")
# note: use conda_env parameter to specify a different environment if needed
# note: ML versions v0-v4 are available with varying parameter counts (9M-31M). Default is v1.
# perform merge using combined network + machine learning approach
z_linked_combined <- LinkOrgs(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
AveMatchNumberPerAlias = 10,
AveMatchNumberPerAlias_network = 1,
algorithm = "markov",
DistanceMeasure = "ml", ml_version = "v1")
# Perform a merge using the ML approach, exporting name representations only
# Returns list(embedx = ..., embedy = ...) for manual linkage.
rep_joint <- LinkOrgs(
x = x, y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
algorithm = "ml",
ExportEmbeddingsOnly = TRUE
)
# returns list(embedx = ...)
rep_x <- LinkOrgs(
x = x, y = NULL,
by.x = "orgnames_x",
algorithm = "ml",
ExportEmbeddingsOnly = TRUE
)
# returns list(embedy = ...)
rep_y <- LinkOrgs(
x = NULL, y = y,
by.y = "orgnames_y",
algorithm = "ml",
ExportEmbeddingsOnly = TRUE)
Using the package, we can also assess performance against a ground-truth merged dataset (if available):
# (After running the above code)
z_true <- data.frame("orgnames_x"=x_orgnames, "orgnames_y"=y_orgnames)
# Get performance matrix
PerformanceMatrix <- AssessMatchPerformance(x = x,
y = y,
by.x = "orgnames_x",
by.y = "orgnames_y",
z = z_linked_fuzzy,
z_true = z_true)
We're always looking to improve the software in terms of ease-of-use and its capabilities. If you have any suggestions/feedback, or need further assistance in getting the package working for your analysis, please email connor.jerzak@gmail.com.
In future releases, we will be expanding the merge capabilities (currently, we only allow inner joins [equivalent to setting all = F in the merge function from base R]; future releases will allow more complex inner, left, right, and outer joins).
We thank Beniamino Green, Kosuke Imai, Gary King, Xiang Zhou, members of the Imai Research Workshop for valuable feedback. We also would like to thank Gil Tamir and Xiaolong Yang for excellent research assistance.
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). This package is for academic and non-commercial use only.
Brian Libgober, Connor T. Jerzak. "Linking Datasets on Organizations Using Half-a-billion Open-collaborated Records." Political Science Methods and Research, 2024. [PDF] [Dataverse] [Hugging Face]
@article{libgober2024linking,
title={Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records},
author={Libgober, Brian and Connor T. Jerzak},
journal={Political Science Methods and Research},
year={2024},
pages={},
publisher={Cambridge University Press}
}
Green, Beniamino. "Zoomerjoin: Superlatively-Fast Fuzzy Joins." Journal of Open Source Software 8:89 5693-5698, 2023. [PDF]