Skip to content

cjerzak/LinkOrgs-software

Repository files navigation

LinkOrgs: An R package for linking records on organizations using half-a-billion open-collaborated records from LinkedIn

Lifecycle: experimental

What is LinkOrgs? | Installation | Tutorial | Comparison with Ground Truth | References | Documentation

Demo Button

Hugging Face Dataset

Hugging Face Space

Note: You can access a point-and-click implementation online here.

What is LinkOrgs?

LinkOrgs is an R package for organizational record linkage that leverages half-a-billion open-collaborated records from LinkedIn. It provides multiple matching algorithms optimized for different use cases:

Algorithm Internet Required ML-backend Required Speed Best For
fuzzy No No Fast Simple name matching
bipartite Yes No Medium Network-informed matching best for organizations having LinkedIn presence, ~2017
markov Yes No Medium Network-informed matching best for organizations having LinkedIn presence, ~2017
ml Yes Yes Slower High-accuracy semantic matching
transfer Yes Yes Slower Combined network + ML approach
  • Fuzzy matching (algorithm="fuzzy"): Fast parallelized string distance matching using Jaccard, Jaro-Winkler, or other string distances
  • Network-based (algorithm="bipartite" or "markov"): Uses LinkedIn's organizational network structure for improved accuracy
  • Machine learning (algorithm="ml"): Transformer-based embeddings (requires JAX backend setup via BuildBackend())
  • Combined (algorithm="markov" + DistanceMeasure="ml"): Network + ML hybrid approach

Installation

The most recent version of LinkOrgs can be installed directly from the repository using the devtools package

# install package 
devtools::install_github("cjerzak/LinkOrgs-software/LinkOrgs")

The machine-learning based algorithm accessible via the algorithm="ml" option relies on jax. The network-based linkage approaches (algorithm="bipartite" and algorithm = "markov") do not require these packages. To setup the machine learning backend, you can call

# install ML backend  
LinkOrgs::BuildBackend(conda = "auto")

Note that most package options require Internet access in order to download the saved machine learning model parameters and LinkedIn-based network information.

Quick Start

library(LinkOrgs)

# Sample data
x <- data.frame(org = c("Apple Inc", "Microsoft Corp"))
y <- data.frame(org = c("Apple", "Microsoft Corporation"))

# Link organizations using fuzzy matching
result <- LinkOrgs(x = x, y = y, by.x = "org", by.y = "org",
                   algorithm = "fuzzy", AveMatchNumberPerAlias = 2)
print(result)

Tutorial

After installing the package, let's get some experience with it in an example.

# load in package 
library(LinkOrgs)

# set up synthetic data for the merge 
x_orgnames <- c("apple","oracle","enron inc.","mcdonalds corporation")
y_orgnames <- c("apple corp","oracle inc","enron","mcdonalds")
x <- data.frame("orgnames_x"=x_orgnames)
y <- data.frame("orgnames_y"=y_orgnames)

After creating these synthetic datasets, we're now ready to merge them. We can do this in a number of ways. See the paper listed in the reference for information about which may be most useful for your merge task.

First, we'll try a merge using parallelized fast fuzzy matching via LinkOrgs::LinkOrgs. A key hyperparameter is AveMatchNumberPerAlias, which controls the number of matches per alias (in practice, we calibrate this with an initial random sampling step, the exact matched dataset size won't be a perfect multiple of AveMatchNumberPerAlias). Here, we set AveMatchNumberPerAlias = 10 so that all observations in this small dataset are potentially matched against all others for illustration purposes.

# perform merge using (parallelized) fast fuzzy matching
# LinkOrgs::LinkOrgs can be readily used for non-organizational name matches 
# when doing pure parallelized fuzzy matching 
z_linked_fuzzy <- LinkOrgs::LinkOrgs(x  = x,
                        y =  y,
                        by.x = "orgnames_x",
                        by.y = "orgnames_y",
                        algorithm = "fuzzy", 
                        DistanceMeasure = "jaccard", 
                        AveMatchNumberPerAlias = 4)

Next, we'll try using some of the LinkedIn-calibrated approaches using LinkOrgs::LinkOrgs:

# perform merge using bipartite network approach
z_linked_bipartite <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "bipartite", 
                     DistanceMeasure = "jaccard")
                     
# perform merge using markov network approach
z_linked_markov <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10,
                     algorithm = "markov", 
                     DistanceMeasure = "jaccard")


# Build backend for ML model (run once before using algorithm="ml")
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env", conda = "auto")
# If conda = "auto" fails, specify the path explicitly:
# LinkOrgs::BuildBackend(conda_env = "LinkOrgs_env",
#                        conda = "/path/to/miniforge3/bin/python")
                     
# perform merge using a machine learning approach
z_linked_ml <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     algorithm = "ml", ml_version = "v1")
# note: use conda_env parameter to specify a different environment if needed
# note: ML versions v0-v4 are available with varying parameter counts (9M-31M). Default is v1.

# perform merge using combined network + machine learning approach
z_linked_combined <- LinkOrgs(x  = x, 
                     y =  y, 
                     by.x = "orgnames_x", 
                     by.y = "orgnames_y",
                     AveMatchNumberPerAlias = 10, 
                     AveMatchNumberPerAlias_network = 1, 
                     algorithm = "markov",
                     DistanceMeasure = "ml", ml_version = "v1")

  # Perform a merge using the ML approach, exporting name representations only
  # Returns list(embedx = ..., embedy = ...) for manual linkage.
  rep_joint <- LinkOrgs( 
    x = x, y = y,
    by.x = "orgnames_x",
	by.y = "orgnames_y",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE
  )
  
  # returns list(embedx = ...)
  rep_x <- LinkOrgs( 
    x = x, y = NULL,
    by.x = "orgnames_x",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE
  ) 
  
  # returns list(embedy = ...)
  rep_y <- LinkOrgs( 
    x = NULL, y = y,
    by.y = "orgnames_y",
    algorithm = "ml",
    ExportEmbeddingsOnly = TRUE) 

Comparison of Results with Ground Truth

Using the package, we can also assess performance against a ground-truth merged dataset (if available):

# (After running the above code)
z_true <- data.frame("orgnames_x"=x_orgnames, "orgnames_y"=y_orgnames)

# Get performance matrix 
PerformanceMatrix <- AssessMatchPerformance(x  = x, 
                                            y =  y, 
                                            by.x = "orgnames_x", 
                                            by.y = "orgnames_y", 
                                            z = z_linked_fuzzy, 
                                            z_true = z_true)
Figure 1 – light Figure 1 – dark Figure 2 – light Figure 2 – dark

Improvements & Future Development Plan

We're always looking to improve the software in terms of ease-of-use and its capabilities. If you have any suggestions/feedback, or need further assistance in getting the package working for your analysis, please email connor.jerzak@gmail.com.

In future releases, we will be expanding the merge capabilities (currently, we only allow inner joins [equivalent to setting all = F in the merge function from base R]; future releases will allow more complex inner, left, right, and outer joins).

Acknowledgments

We thank Beniamino Green, Kosuke Imai, Gary King, Xiang Zhou, members of the Imai Research Workshop for valuable feedback. We also would like to thank Gil Tamir and Xiaolong Yang for excellent research assistance.

License

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). This package is for academic and non-commercial use only.

References

Brian Libgober, Connor T. Jerzak. "Linking Datasets on Organizations Using Half-a-billion Open-collaborated Records." Political Science Methods and Research, 2024. [PDF] [Dataverse] [Hugging Face]

@article{libgober2024linking,
  title={Linking Datasets on Organizations Using Half a Billion Open-Collaborated Records},
  author={Libgober, Brian and Connor T. Jerzak},
  journal={Political Science Methods and Research},
  year={2024},
  pages={},
  publisher={Cambridge University Press}
}

Related work

Green, Beniamino. "Zoomerjoin: Superlatively-Fast Fuzzy Joins." Journal of Open Source Software 8:89 5693-5698, 2023. [PDF]

About

LinkOrgs: An R package for linking linking records on organizations using half a billion open-collaborated records from LinkedIn

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published