A movie recommendation system, with implementations of:
- User-User Collaborative Filtering
- Item-Item Collaborative Filtering
- Matrix Factorization with Stochastic Gradient Descent
- Online Learning for Rapid Recommendations
Flask
based web application available on Heroku at this link: https://movie-oracle.herokuapp.com/movies
For usage, refer to the Usage
section below. EDIT: The link is temporarily down
I started off by building a webscraper for IMDb. I used links.csv
and ratings.csv
of the GroupLens dataset to obtaid a list of IMDb URLs. I then wrote a script that processed 200 of the most rated (the number of people who have rated it) movies and separated them into bins, based on some genres for 50 users. The sampling was done for a closer aproximation of real world movies.
Action | Romance | Comedy | Drama | Horror | Sci-fi | Animation | Fantasy |
---|---|---|---|---|---|---|---|
30 | 17 | 29 | 37 | 10 | 29 | 20 | 29 |
I used mlab
to host my MongoDB database and beautifulsoup4
to parse the HTML files
I was able to optimize the calculations (especially the Similarity Matrix
) by using numpy operations. I also enabled it to take weights of the top N similar users. Metric for similarity was Cosine Similarity. The algorithm returns ratings for an item weighed by the similarity of other users to itself.
Same as User based filtering, I took the transpose of the initial matrix of users and items for ratings and followed a similar structure. Optimized with numpy operations. The algorithm returns ratings for an item weighted by the similarity of that item to others and the user's ratings for those items.
I attempted to represent this as two matrices with vectors of users and items containing latent factors. I included user and item biases for ratings and L2 regularization for the loss function. I used stochastic gradient descent
to optimize their latent factors after solving the update rules for myself. I have managed to optimize it somewhat with numpy, and I am obtaining convergence in 400 - 600 iterations.
The convergence of validation loss for the previous method usually takes more than 30 seconds, after which Heroku requests a timeout and the application crashes. For that purpose I performed gradient descent on a single vector representing the user input ratings and trained its latent factors using the optimized item_latent_factor matrix
and optimized item bias
I calculated in the previous method. This allows for rapid training online to produce recommendations without having to resolve the original ratings matrix.
K is the number of ratings the user must input in order to recieve K recommendations. I was able to calculate the RMSE for the predicted ratings of the top K movies the matrix factorization algorithm returns against the actual ratings given by the user. I did this by keeping only K random ratings in the ratings vector of the user and making the rest unrated (NaN). Then, using the matrix factorizing algorithm, I got predictions for the other items and calculated RMSE for top K. I then averaged over the RMSE errors I got for various values of K over 10 randomly selected users. Keeping in mind user convenience, I picked K from 3, 5, 7.
K is optimally = 5, at least for my dataset
(The Orange Line is the average RMSE for K. The blue is a usual example)
Open the Heroku link. Rate movies individually, ie insert a number from 1 to 5 in the text box and click "Rate!". Wait for the page to refresh, then continue with the next rating. Please note that the app can't take multiple values at once and each movie must be rated one at a time. After five inputs, the application takes you to the Recommendation page.
Run the application locally using python3 app.py
.
The web application was made using flask
. Communication with the MongoDB database with 'POST' and 'GET' requests using pymongo
.
On the front end. Due to some logical bug, Null objects are inserted in the database, while rejecting a movie. If this movie is recommended by the application, it appears as a ()
in the page.
No directories, apart from templates
used to store HTML files and mf_plots
that store the RMSE plots over values of K, the items rated by a user.
Main components:
collab_filter.py
: All Matrix Factorization and Collaborative Filtering Algorithmsapp.py
: The main web application, prints debug notes in the console. Run the application locally usingpython3 app.py
determine_k.py
: Used for computing the optimal K value.imdb2mongodb_scraper.py
: Used for building MongoDB database onmlab
.get_opt_matrices.py
: Used to computeoptimized_item_matrix
andoptimized_item_bias
.
Python 3.7
: Heroku uses the latest version of python.pymongo
: To communicate with the MongoDB database hosted atmlab
and handle all requests.beautifulsoup4
: To use for web scraping.flask
: Front-end and back-end related interfaces.numpy
: To process matrix and vector operations.
- Equations and update rules: https://stanford.edu/~rezab/classes/cme323/S16/projects_reports/baalbaki.pdf
- References for implementation: https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea#intromf , http://www.albertauyeung.com/post/python-matrix-factorization/
- The equations for getting similarity: https://en.wikipedia.org/wiki/Collaborative_filtering
- References for implementation: https://www.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/
- Web-Scraping tutorial with Beautiful Soup: https://www.dataquest.io/blog/web-scraping-beautifulsoup/
- Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Mongo Shell: https://docs.mongodb.com/manual/reference/mongo-shell/
- MongoDB Python API: https://api.mongodb.com/python/current/
- Usage: https://docs.mongodb.com/manual/