Data de-duplication and record linkage in financial systems Record matching is an important process for data integration, reconciliation and data cleaning by de-duplication, is a task of identifying records within one or multiple databases that refer to the same entity. Duplicate records often do not share common key and contain erroneous data that makes record matching a demanding task. The objectives of this project is • Develop a technique using cocktail ap-proach to produce a record matching and data de-duplication technique for financial record systems. Today, large collections of financial records are stored in databases, which may contain multiple records that refer to same subjects full information can be built by combining all information referring to an entity. Simple string matching will not be a feasible option for detecting duplicate records because of the inconsistencies such as data entry errors, typographical errors, data in different formats and missing data. Record linkage algorithms are classified in two broad categories, a rule-based or heuristic approach or a probabilistic-based approach. In this project we use cocktail algorithm, that is, we will use rule-based and probabilistic algorithms both to get the best F-score and recall value. Since, is case of rule based approach domain knowledge is critical and often leads to issues if manually created, hence EM algorithms (Expectation Maximization) will be used to generate rules based on data itself. This model gives the best results.
-
Notifications
You must be signed in to change notification settings - Fork 0
CodeHuman96/Data_Cleaning__financial_systems
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
Data de-duplication and record linkage in financial systems
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published