This project implements a movie recommender using a variety of recommendation systems (NMF, ALS and Cosine Similarity).
For this the popular Movielens dataset (largest 25M version, reduced to 1.1M ratings from 2019) is used.
Flask is used to build a web app that gives users personalized recommendations and basic movie infos.
Apache Spark, a cluster-computing framework for very large amounts of data, is used to run the ALS (Alternating Least Squares) model, NMF and Cosine Similarity are implemented with Scikit-Learn.
Dash is used to visualize rating informations (average rating and total ratings over time, ratings per user).
This project started as a great group project together with Julia and Alvaro during SPICED's Data Science program and is now continued solo afterwards.
Become more familiar with recommendation systems and big data management.
Companies like Netflix or Amazon prominently use recommendation systems to generate more user-oriented suggestions.
Non-Negative Matrix Factorization, Collaborative Filtering are two techniques used by recommender systems that are implemented in this project.
The movie recommender is based on the latest dataset released by MovieLens in 12/2019 that includes 25 million ratings.
All data is stored in a Postgres database on an AWS RDS.
Data was reduced to ratings from 2019 (~1.15 million ratings).
Example SQL query (get average rating and total ratings for movies in 2019):
SELECT
movies.movie_id,
movies.title,
movies.genres,
AVG(ratings_2019.rating) AS avg_rating,
COUNT(ratings_2019.user_id) AS total_ratings
FROM
movies
JOIN
(
SELECT
ratings.user_id,
ratings.movie_id,
ratings.rating
FROM ratings
WHERE ratings.rating_timestamp >= '2019-01-01'
)
AS ratings_2019
ON
ratings_2019.movie_id = movies.movie_id
GROUP BY
movies.movie_id, movies.title, movies.genres
ORDER BY
total_ratings DESC, avg_rating DESC;
Movies on the landing page were selected based on how popular they were in 2019, how representative they are for a certain genre and how polarising they are among viewers.
Planet Earth II for example received the highest average score among all movies that were rated at least 250 times.
Twilight on the other hand is the most polarising movie of 2019.
An exploratory data analysis is performed with pandas here.
An additional analysis is done with pyspark here.
Good overview for pyspark: https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf
Holden Karau's Videos on Spark: https://www.youtube.com/playlist?list=PLRLebp9QyZtaoIpE2iaF3Q8itJOcdgYoX
- Add tests
- Implement dash dashboard for basic infos on ratings in dataset
- Implement spark udfs instead of regular python functions in ALS recommender to potentially improve speed
- Try out spark script on cloud cluster with spark-submit
- Make web app accessible on heroku or github