Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
dedupe		dedupe
examples		examples
src		src
test		test
.gitignore		.gitignore
README.rdoc		README.rdoc
setup.py		setup.py

Repository files navigation

Dedupe Python Library¶ ↑

A free python library for accurate and scaleable deduplication and entity-resolution.

Based on Mikhail Yuryevich Bilenko’s Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering

Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.

Python Dependencies¶ ↑

numpy (numpy.scipy.org/)
hierarchical cluster depends upon
- fastcluster (math.stanford.edu/~muellner/fastcluster.html)
- hcluster (code.google.com/p/scipy-cluster/)

Team¶ ↑

Forest Gregg fgregg@gmail.com
Derek Eder derek.eder@gmail.com

Usage¶ ↑

> python setup.py install
> python examples/active_canonical_example.py
(use 'y', 'n' and 'u' keys to flag duplicates for active learning)

Example datasets¶ ↑

As we continue to refine this library, we have added several datasets to test against. These can all be executed from the examples/ directory.

The following use human input to flag duplicates:

active_canonical_example.py - 864 rows. canonical restaurant dataset from Bilenko’s research
early_childhood.py - 3,720 rows. compilation of 9 datasets containing locations for early childhood education in Chicago
tech_locator.py - 852 rows. compilation of 2 lists of locations of technology resources in the City of Chicago.

The following do not use human input:

canonical_example.py - loads in canonical restaurant test data and trains based on provided known duplicates. outputs precision and recall values

Errors / Bugs¶ ↑

If something is not behaving intuitively, it is a bug, and should be reported. Report it here: github.com/open-city/dedupe/issues

Note on Patches/Pull Requests¶ ↑

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches.

Copyright¶ ↑

See LICENSE for details github.com/open-city/dedupe/wiki/License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dedupe Python Library¶ ↑

Python Dependencies¶ ↑

Team¶ ↑

Usage¶ ↑

Example datasets¶ ↑

Errors / Bugs¶ ↑

Note on Patches/Pull Requests¶ ↑

Copyright¶ ↑

About

Used by 334

Contributors 54

Languages

License

dedupeio/dedupe

Folders and files

Latest commit

History

Repository files navigation

Dedupe Python Library¶ ↑

Python Dependencies¶ ↑

Team¶ ↑

Usage¶ ↑

Example datasets¶ ↑

Errors / Bugs¶ ↑

Note on Patches/Pull Requests¶ ↑

Copyright¶ ↑

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Used by 334

Contributors 54

Languages