Entity Resolution

Matching algorithm for movies in Amazon and Rotten Tomatoes datasets

Task

Given two datasets (one from Amazon and one from Rotten Tomatoes) that describe the same movie entities, identify in the test set whether each pair of two IDs (one ID each from the two datasets) refer to the same movie. Write a script that loads these datasets, runs a binary classificiation matching algorithm to determine if two paired movies are the same, and then measure the precision, recall, and F1-score of the algorithm on the training set. Final predictions are submitted to a leaderboard page on Instabase for competition among students in groups of 1-3.

This assignment was done for the Spring 2017 course in Computing Systems for Data Science (COMS W4121).

Results

The final matching algorithm involved standardizing features involving movie time length; cleaning the time and movie star variables; engineering new features that computed string similarity scores for corresponding pairs of the movie star and movie director variables, as well as calculated the film length difference of each movie pair; and running a random forest classifier with optimized parameters based on these new features. On the Instabase leaderboard, this model resulted in a precision, recall, and F-measure of 91.25%.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Entity_Resolution_Writeup.ipynb		Entity_Resolution_Writeup.ipynb
LICENSE		LICENSE
README.md		README.md
entity_resolution.py		entity_resolution.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Entity Resolution

About

Uh oh!

Releases

Packages

Languages

License

cmeb45/entity_resolution

Folders and files

Latest commit

History

Repository files navigation

Entity Resolution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages