Scalable master data management, identity resolution, entity resolution, and deduplication using ML
-
Updated
Jul 3, 2026 - Java
Entity resolution (also known as data matching, data linkage, record linkage, and many other terms) is the task of finding entities in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Entity resolution is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference.
Scalable master data management, identity resolution, entity resolution, and deduplication using ML
RocketMQ消息幂等去重消费者,支持使用MySQL或者Redis做幂等表,开箱即用
Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Java DSL for (online) deduplication
WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.
A general purpose deduplication framework
🖇️Record Linkage tool used by https://cidacs.bahia.fiocruz.br/
PRIMAT - Private Matching Toolbox
Mirror of https://bitbucket.org/resteorts/smered
A java based database driven backup tool with multi storage support and other nice things
Deduplication of EndNote and Zotero RIS files
Data bus based on Apache Kafka and consisting of separate components [copied from own private repos]
Spring Boot-based quiz leaderboard system that processes distributed quiz events, eliminates duplicate submissions, aggregates participant scores, and generates accurate leaderboards using idempotent event processing.
Distributed image deduplication service with block-level storage and hash-based compression
Client-side lossless image deduplication engine using block-level content-addressable storage
Plug-and-play Spring Boot idempotency library using Redis. Deduplication, TTL, metrics, logging.
Resilient Spring Boot data pipeline for distributed API polling, O(1) event deduplication, and leaderboard aggregation.
Project for helping brother in finding duplicates in his photos directory.
A UI application for File Deduplication using Hashing
Created by Halbert L. Dunn
Released 1946