A free python library for accurate and scaleable deduplication and entity-resolution.
Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering
Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.
-
numpy (numpy.scipy.org/)
-
hierarchical cluster depends upon
-
fastcluster (math.stanford.edu/~muellner/fastcluster.html)
-
hcluster (code.google.com/p/scipy-cluster/)
-
-
Forest Gregg fgregg@gmail.com
-
Derek Eder derek.eder@gmail.com
> python setup.py install > python examples/active_canonical_example.py (use 'y', 'n' and 'u' keys to flag duplicates for active learning)
As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.
The following use human input to flag duplicates:
-
active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research
-
early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago
-
tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.
The following do not use human input:
-
canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values
If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues
-
Fork the project.
-
Make your feature addition or bug fix.
-
Send us a pull request. Bonus points for topic branches.
Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.
See LICENSE for details github.com/open-city/dedupe/wiki/License