Skip to content

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

License

Notifications You must be signed in to change notification settings

dedupeio/dedupe

Repository files navigation

Deduplication Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

Team

Usage

> python setup.py build_ext --inplace
> python dedupe.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning)

Other Executable Modules

  • blocking.py - loads in test data and finds optimum blocking predicates

  • canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values

  • predicates.py - tests the functionality of defined predicates

  • training_sample.py - tests active learning with user input

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Send us a pull request. Bonus points for topic branches.

Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details github.com/open-city/dedupe/wiki/License