Skip to content

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

License

Notifications You must be signed in to change notification settings

dedupeio/dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe Python Library

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

See our presentation at ChiPy: pyvideo.org/video/973/big-data-de-duping

Python Dependencies

Usage

> python setup.py install
> python examples/active_canonical_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning)

Example datasets

As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.

The following use human input to flag duplicates:

  • active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research

  • early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago

  • tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.

The following do not use human input:

  • canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values

Team

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Send us a pull request. Bonus points for topic branches.

Copyright © 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.

See LICENSE for details github.com/open-city/dedupe/wiki/License