A free python library for accurate and scaleable deduplication and entity-resolution.
Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering
Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.
-
numpy (numpy.scipy.org/)
-
hierarchical cluster depends upon
-
fastcluster (math.stanford.edu/~muellner/fastcluster.html)
-
hcluster (code.google.com/p/scipy-cluster/)
-
> python setup.py install > python examples/active_canonical_example.py (use 'y', 'n' and 'u' keys to flag duplicates for active learning)
During setup, several c files need to be compiled. For some reason, the clang compiler fails to build correctly on OSX, so LLVM-GCC needs to be used. For more info, see issue #46
As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.
The following use human input to flag duplicates:
-
active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research
-
early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago
-
tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.
The following do not use human input:
-
canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values
-
Forest Gregg fgregg@gmail.com
-
Derek Eder derek.eder@gmail.com
If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues
-
Fork the project.
-
Make your feature addition or bug fix.
-
Send us a pull request. Bonus points for topic branches.
Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.
See LICENSE for details github.com/open-city/dedupe/wiki/License