Skip to content

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

License

Notifications You must be signed in to change notification settings

dedupeio/dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies

Usage

> python setup.py install
> python examples/active_canonical_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning)

Compiling

During setup, several c files need to be compiled. For some reason, the clang compiler fails to build correctly on OSX, so LLVM-GCC needs to be used. For more info, see issue #46

Example datasets

As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.

The following use human input to flag duplicates:

  • active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research

  • early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago

  • tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.

The following do not use human input:

  • canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Send us a pull request. Bonus points for topic branches.

Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details github.com/open-city/dedupe/wiki/License