Skip to content

CodeHuman96/Data_Cleaning__financial_systems

Repository files navigation

Data_Cleaning__financial_systems

Data de-duplication and record linkage in financial systems Record matching is an important process for data integration, reconciliation and data cleaning by de-duplication, is a task of identifying records within one or multiple databases that refer to the same entity. Duplicate records often do not share common key and contain erroneous data that makes record matching a demanding task. The objectives of this project is • Develop a technique using cocktail ap-proach to produce a record matching and data de-duplication technique for financial record systems. Today, large collections of financial records are stored in databases, which may contain multiple records that refer to same subjects full information can be built by combining all information referring to an entity. Simple string matching will not be a feasible option for detecting duplicate records because of the inconsistencies such as data entry errors, typographical errors, data in different formats and missing data. Record linkage algorithms are classified in two broad categories, a rule-based or heuristic approach or a probabilistic-based approach. In this project we use cocktail algorithm, that is, we will use rule-based and probabilistic algorithms both to get the best F-score and recall value. Since, is case of rule based approach domain knowledge is critical and often leads to issues if manually created, hence EM algorithms (Expectation Maximization) will be used to generate rules based on data itself. This model gives the best results.

About

Data de-duplication and record linkage in financial systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages