`LinkOrgs`: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn

Note: You can access a point-and-click implementation online here.

What is LinkOrgs?

LinkOrgs is an R package for organizational record linkage that leverages half-a-billion open-collaborated records from LinkedIn. It provides multiple matching algorithms optimized for different use cases:

Algorithm	Internet Required	ML-backend Required	Speed	Best For
`fuzzy`	No	No	Fast	Simple name matching
`bipartite`	Yes	No	Medium	Network-informed matching best for organizations having LinkedIn presence, ~2017
`markov`	Yes	No	Medium	Network-informed matching best for organizations having LinkedIn presence, ~2017
`ml`	Yes	Yes	Slower	High-accuracy semantic matching
`transfer`	Yes	Yes	Slower	Combined network + ML approach

Fuzzy matching (algorithm="fuzzy"): Fast parallelized string distance matching using Jaccard, Jaro-Winkler, or other string distances
Network-based (algorithm="bipartite" or "markov"): Uses LinkedIn's organizational network structure for improved accuracy
Machine learning (algorithm="ml"): Transformer-based embeddings (requires JAX backend setup via BuildBackend())
Combined (algorithm="markov" + DistanceMeasure="ml"): Network + ML hybrid approach

Installation

The most recent version of LinkOrgs can be installed directly from the repository using the devtools package

# install package 
devtools::install_github("cjerzak/LinkOrgs-software/LinkOrgs")

The machine-learning based algorithm accessible via the algorithm="ml" option relies on jax. The network-based linkage approaches (algorithm="bipartite" and algorithm = "markov") do not require these packages. To setup the machine learning backend, you can call

# install ML backend  
LinkOrgs::BuildBackend(conda = "auto")

Note that most package options require Internet access in order to download the saved machine learning model parameters and LinkedIn-based network information.

Quick Start

library(LinkOrgs)

# Sample data
x <- data.frame(org = c("Apple Inc", "Microsoft Corp"))
y <- data.frame(org = c("Apple", "Microsoft Corporation"))

# Link organizations using fuzzy matching
result <- LinkOrgs(x = x, y = y, by.x = "org", by.y = "org",
                   algorithm = "fuzzy", AveMatchNumberPerAlias = 2)
print(result)

Tutorial

After installing the package, let's get some experience with it in an example.

# load in package 
library(LinkOrgs)

# set up synthetic data for the merge 
x_orgnames <- c("apple","oracle","enron inc.","mcdonalds corporation")
y_orgnames <- c("apple corp","oracle inc","enron","mcdonalds")
x <- data.frame("orgnames_x"=x_orgnames)
y <- data.frame("orgnames_y"=y_orgnames)

After creating these synthetic datasets, we're now ready to merge them. We can do this in a number of ways. See the paper listed in the reference for information about which may be most useful for your merge task.

First, we'll try a merge using parallelized fast fuzzy matching via LinkOrgs::LinkOrgs. A key hyperparameter is AveMatchNumberPerAlias, which controls the number of matches per alias (in practice, we calibrate this with an initial random sampling step, the exact matched dataset size won't be a perfect multiple of AveMatchNumberPerAlias). Here, we set AveMatchNumberPerAlias = 10 so that all observations in this small dataset are potentially matched against all others for illustration purposes.

# perform merge using (parallelized) fast fuzzy matching
# LinkOrgs::LinkOrgs can be readily used for non-organizational name matches 
# when doing pure parallelized fuzzy matching 
z_linked_fuzzy <- LinkOrgs::LinkOrgs(x  = x,
                        y =  y,
                        by.x = "orgnames_x",
                        by.y = "orgnames_y",
                        algorithm = "fuzzy", 
                        DistanceMeasure = "jaccard", 
                        AveMatchNumberPerAlias = 4)

Next, we'll try using some of the LinkedIn-calibrated approaches using LinkOrgs::LinkOrgs:

# perform merge using bipartite network approach
z_linked_bipartite <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "bipartite", 
                     DistanceMeasure = "jaccard")
                     
# perform merge using markov network approach
z_linked_markov <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "markov", 
                     DistanceMeasure = "jaccard")


# Build backend for ML model (run once before using algorithm="ml")
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env", conda = "auto")
# If conda = "auto" fails, specify the path explicitly:
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env",
#                        conda = "/path/to/miniforge3/bin/python")
                     
# perform merge using a machine learning approach
z_linked_ml <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     algorithm = "ml", ml_version = "v1")
# note: use conda_env parameter to specify a different environment if needed
# note: ML versions v0-v4 are available with varying parameter counts (9M-31M). Default is v1.

# perform merge using combined network + machine learning approach
z_linked_combined <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     AveMatchNumberPerAlias_network = 1, 
                     algorithm = "markov",
                     DistanceMeasure = "ml", ml_version = "v1")

  # Perform a merge using the ML approach, exporting name representations only
  # Returns list(embedx = ..., embedy = ...) for manual linkage.
  rep_joint <- LinkOrgs( 
    x = x, y = y,
    by.x = "orgnames_x",
	by.y = "orgnames_y",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE
  )
  
  # returns list(embedx = ...)
  rep_x <- LinkOrgs( 
    x = x, y = NULL,
    by.x = "orgnames_x",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE
  ) 
  
  # returns list(embedy = ...)
  rep_y <- LinkOrgs( 
    x = NULL, y = y,
    by.y = "orgnames_y",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE)

Comparison of Results with Ground Truth

Using the package, we can also assess performance against a ground-truth merged dataset (if available):

# (After running the above code)
z_true <- data.frame("orgnames_x"=x_orgnames, "orgnames_y"=y_orgnames)

# Get performance matrix 
PerformanceMatrix <- AssessMatchPerformance(x  = x, 
                                            y =  y, 
                                            by.x = "orgnames_x", 
                                            by.y = "orgnames_y", 
                                            z = z_linked_fuzzy, 
                                            z_true = z_true)

Improvements & Future Development Plan

We're always looking to improve the software in terms of ease-of-use and its capabilities. If you have any suggestions/feedback, or need further assistance in getting the package working for your analysis, please email connor.jerzak@gmail.com.

In future releases, we will be expanding the merge capabilities (currently, we only allow inner joins [equivalent to setting all = F in the merge function from base R]; future releases will allow more complex inner, left, right, and outer joins).

Acknowledgments

We thank Beniamino Green, Kosuke Imai, Gary King, Xiang Zhou, members of the Imai Research Workshop for valuable feedback. We also would like to thank Gil Tamir and Xiaolong Yang for excellent research assistance.

License

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). This package is for academic and non-commercial use only.

References

Brian Libgober, Connor T. Jerzak. "Linking Datasets on Organizations Using Half-a-billion Open-collaborated Records." Political Science Methods and Research, 2024. [PDF] [Dataverse] [Hugging Face]

@article{libgober2024linking,
  title={Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records},
  author={Libgober, Brian and Connor T. Jerzak},
  journal={Political Science Methods and Research},
  year={2024},
  pages={},
  publisher={Cambridge University Press}
}

Related work

Green, Beniamino. "Zoomerjoin: Superlatively-Fast Fuzzy Joins." Journal of Open Source Software 8:89 5693-5698, 2023. [PDF]

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
LinkOrgs.Rcheck		LinkOrgs.Rcheck
LinkOrgs		LinkOrgs
misc		misc
tutorials		tutorials
LICENSE		LICENSE
LinkOrgs.pdf		LinkOrgs.pdf
LinkOrgs_0.01.tar.gz		LinkOrgs_0.01.tar.gz
README.md		README.md
documentPackage.R		documentPackage.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`LinkOrgs`: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn

What is LinkOrgs?

Installation

Quick Start

Tutorial

Comparison of Results with Ground Truth

Improvements & Future Development Plan

Acknowledgments

License

References

Related work

About

Uh oh!

Releases

Packages

Languages

License

cjerzak/LinkOrgs-software

Folders and files

Latest commit

History

Repository files navigation

LinkOrgs: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn

What is LinkOrgs?

Installation

Quick Start

Tutorial

Comparison of Results with Ground Truth

Improvements & Future Development Plan

Acknowledgments

License

References

Related work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`LinkOrgs`: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn

Packages