Data Matching

Overview

This is the class project of CS784 Data Models and Languages. In this project, we crawled to retrieve HTML data from two web sources, and performed information extraction to convert the HTML data into two relational tables. Then, we used Magellan, a data matching system developed here at Wisconsin, to match the two tables. The project has three stages: data crawling and extraction, blocking, and matching.

Project webpage: https://blackruan.github.io/data-matching/.

Data Crawling & Extraction

We chose Yelp and Yellow Pages as our two web sources and crawled 9,947 restaurant htmls from Yelp, 28,787 htmls from Yellow Pages. Next, BeautifulSoup Package was used to extract the information from html and convert it into two relational tables (csv format).

Blocking

After the previous stage, we had 9,947 tuples in table A (Yelp), 28,787 tuples in table B (Yellow Pages). If we go on with these tables directly, we would have approximately 280,000,000 tuple pairs to perform matching, which was way too many for the matching system. Thus, we needed to apply some simple rules to perform blocking before we go into the next stage. The rules we developed are as follows.

Check blocking explanation for more information. After blocking, the original 280,000,000 tuple pairs was reduced to roughly 21,000 pairs.

Matching

We performed cross validation on each of the following methods to select the best matcher.

Decision Tree (DT)
Random Forest (RF)
Support Vector Machine (SVM)
Naive Bayes (NB)
Logistic Regression (LG)
Linear Regression (LN)

Finally, we set Random Forest as our best matcher. Check matching explanation for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
blocking		blocking
crawling		crawling
documentation/images		documentation/images
matching		matching
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Matching

Overview

Data Crawling & Extraction

Blocking

Matching

About

Uh oh!

Releases

Packages

Languages

blackruan/data-matching

Folders and files

Latest commit

History

Repository files navigation

Data Matching

Overview

Data Crawling & Extraction

Blocking

Matching

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages