A free python library for accurate and scaleable deduplication and entity-resolution.
Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering
Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.
-
numpy (numpy.scipy.org/)
-
Forest Gregg
-
Derek Eder derek.eder@opencityapps.org
> python setup.py build_ext --inplace > python dedupe.py (use 'y', 'n' and 'u' keys to flag duplicates for active learning)
-
blocking.py - loads in test data and finds optimum blocking predicates
-
canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values
-
predicates.py - tests the functionality of defined predicates
-
training_sample.py - tests active learning with user input
If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues
-
Fork the project.
-
Make your feature addition or bug fix.
-
Send us a pull request. Bonus points for topic branches.
Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.
See LICENSE for details github.com/open-city/dedupe/wiki/License