movie-oracle

A movie recommendation system, with implementations of:

User-User Collaborative Filtering
Item-Item Collaborative Filtering
Matrix Factorization with Stochastic Gradient Descent
Online Learning for Rapid Recommendations

Flask based web application available on Heroku at this link: https://movie-oracle.herokuapp.com/movies For usage, refer to the Usage section below. EDIT: The link is temporarily down

Approach

MongoDB database

I started off by building a webscraper for IMDb. I used links.csv and ratings.csv of the GroupLens dataset to obtaid a list of IMDb URLs. I then wrote a script that processed 200 of the most rated (the number of people who have rated it) movies and separated them into bins, based on some genres for 50 users. The sampling was done for a closer aproximation of real world movies.

Action	Romance	Comedy	Drama	Horror	Sci-fi	Animation	Fantasy
30	17	29	37	10	29	20	29

I used mlab to host my MongoDB database and beautifulsoup4 to parse the HTML files

Algorithms

User-User Collaborative Filtering

I was able to optimize the calculations (especially the Similarity Matrix) by using numpy operations. I also enabled it to take weights of the top N similar users. Metric for similarity was Cosine Similarity. The algorithm returns ratings for an item weighed by the similarity of other users to itself.

Item-Item Collaborative Filtering

Same as User based filtering, I took the transpose of the initial matrix of users and items for ratings and followed a similar structure. Optimized with numpy operations. The algorithm returns ratings for an item weighted by the similarity of that item to others and the user's ratings for those items.

Matrix Factorization

I attempted to represent this as two matrices with vectors of users and items containing latent factors. I included user and item biases for ratings and L2 regularization for the loss function. I used stochastic gradient descent to optimize their latent factors after solving the update rules for myself. I have managed to optimize it somewhat with numpy, and I am obtaining convergence in 400 - 600 iterations.

Online Learning Rapid Recommendation

The convergence of validation loss for the previous method usually takes more than 30 seconds, after which Heroku requests a timeout and the application crashes. For that purpose I performed gradient descent on a single vector representing the user input ratings and trained its latent factors using the optimized item_latent_factor matrix and optimized item bias I calculated in the previous method. This allows for rapid training online to produce recommendations without having to resolve the original ratings matrix.

Deciding K

K is the number of ratings the user must input in order to recieve K recommendations. I was able to calculate the RMSE for the predicted ratings of the top K movies the matrix factorization algorithm returns against the actual ratings given by the user. I did this by keeping only K random ratings in the ratings vector of the user and making the rest unrated (NaN). Then, using the matrix factorizing algorithm, I got predictions for the other items and calculated RMSE for top K. I then averaged over the RMSE errors I got for various values of K over 10 randomly selected users. Keeping in mind user convenience, I picked K from 3, 5, 7.

K is optimally = 5, at least for my dataset

(The Orange Line is the average RMSE for K. The blue is a usual example)

Usage

Open the Heroku link. Rate movies individually, ie insert a number from 1 to 5 in the text box and click "Rate!". Wait for the page to refresh, then continue with the next rating. Please note that the app can't take multiple values at once and each movie must be rated one at a time. After five inputs, the application takes you to the Recommendation page.

Run the application locally using python3 app.py.

The web application was made using flask. Communication with the MongoDB database with 'POST' and 'GET' requests using pymongo.

Bugs

On the front end. Due to some logical bug, Null objects are inserted in the database, while rejecting a movie. If this movie is recommended by the application, it appears as a () in the page.

Directory Structure and Files

No directories, apart from templates used to store HTML files and mf_plots that store the RMSE plots over values of K, the items rated by a user. Main components:

collab_filter.py: All Matrix Factorization and Collaborative Filtering Algorithms
app.py: The main web application, prints debug notes in the console. Run the application locally using python3 app.py
determine_k.py: Used for computing the optimal K value.
imdb2mongodb_scraper.py: Used for building MongoDB database on mlab.
get_opt_matrices.py: Used to compute optimized_item_matrix and optimized_item_bias.

Dependencies

Python 3.7: Heroku uses the latest version of python.
pymongo: To communicate with the MongoDB database hosted at mlab and handle all requests.
beautifulsoup4: To use for web scraping.
flask: Front-end and back-end related interfaces.
numpy: To process matrix and vector operations.

Sources

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
mf_plots		mf_plots
templates		templates
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
app.py		app.py
collab_filter.py		collab_filter.py
determine_k.py		determine_k.py
get_opt_matrices.py		get_opt_matrices.py
imdb2mongodb_scraper.py		imdb2mongodb_scraper.py
imdb2movie_id.npy		imdb2movie_id.npy
imdb_scrapper_url_only.py		imdb_scrapper_url_only.py
matrix_predictor.py		matrix_predictor.py
movie2index_id.npy		movie2index_id.npy
optimized_item_bias.npy		optimized_item_bias.npy
optimized_item_matrix.npy		optimized_item_matrix.npy
rating_bias.npy		rating_bias.npy
ratings_matrix_for_web.npy		ratings_matrix_for_web.npy
requirements.txt		requirements.txt
web_dataset.py		web_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

movie-oracle

Approach

MongoDB database

Algorithms

User-User Collaborative Filtering

Item-Item Collaborative Filtering

Matrix Factorization

Online Learning Rapid Recommendation

Deciding K

Usage

Bugs

Directory Structure and Files

Dependencies

Sources

Matrix Factorization

Collaborative Filtering

IMDb Web Scraper

MongoDB Database

About

Releases

Packages

Languages

divyam02/movie-oracle

Folders and files

Latest commit

History

Repository files navigation

movie-oracle

Approach

MongoDB database

Algorithms

User-User Collaborative Filtering

Item-Item Collaborative Filtering

Matrix Factorization

Online Learning Rapid Recommendation

Deciding K

Usage

Bugs

Directory Structure and Files

Dependencies

Sources

Matrix Factorization

Collaborative Filtering

IMDb Web Scraper

MongoDB Database

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages