Skip to content

My final project for a ML course, which was a recommender system using CB, CF (user-user & MF) approaches

Notifications You must be signed in to change notification settings

MMDPROJECT/recommender-system

Repository files navigation

title emoji colorFrom colorTo sdk app_port tags pinned python_version app_file short_description
recommender-system
🚀
red
red
docker
8501
streamlit
false
3.12.3
app/main.py
Streamlit template space

Recommender System

Machine Learning - Final Project

Introduction

In this project we will focus on implementing various techniques for recommending movies to users. Techniques used include a simple method which is recommendation based on IMDB Weighted-Rating of the movies, and more sophisticated algorithms such as Content Based, K-NN User-User and Matrix Factorization Collabrative Filtering via Surprise [1].

Installation & How to Use:

Firstly you can go to this link to work with the app in my live space in Hugging Face. Or if you want to run the app locally, follow the procedures below:

  1. Clone the project: git clone https://huggingface.co/spaces/MMDPROJECT/recommender-system.
  2. Install the python dependencies: pip3 install -r requirements.txt.
  3. Run the app using streamlit: streamlit run app/main.py.

Data

Data Source:

We have used two main sources for gathering data in this work; first part which includes the metadata, keywords, and cast/crew information about the different movies in our project, and the second part which includes a large dataset of ratings to different movies.
The first part has been gathered from the immensive movies dataset of TMDB [2]. Also the second part has been provided by MovieLens [3] datascience group. The data could be fetched in one place in the kaggle page in [4].

Data Orientation:

The following table summarises the datasets that we will work with.

Dataset Name Summary Columns
movies_metadata.csv Contains metadata related to each of the movies from TMDB. It has 45466 entries. belongs_to_collection, genres, id (tmdbId), imdb_id, overview, tagline, title, vote_average (in TMDB), vote_count (in TMDB)
credits.csv Contains data about the cast/crew members in the movie. It has 45476 entries. cast, crew, id (tmdbId)
keywords.csv Contains keywords for each movie. It has 46419 entries. keywords, id (tmdbId)
ratings.csv Contains ratings that different users have given to movies. It has 100004 entries. userId, movieId (id in MovieLens), rating, timestamp
links.csv Contains tmdbId, and MovieLensId for each movie. It has 9125 entries. movieId (MovieLensId), imdbId, tmdbId

Data Preprocessing

Single Dataset Preprocessing:

Movies Dataset

  • Clean IMDb IDs by removing the 'tt' prefix.
  • Drop specific invalid rows (indices: 19730, 29502, 35585).
  • Drop rows with missing IMDb or TMDB IDs.
  • Convert TMDB and IMDb IDs to integers.
  • Extract genres from JSON strings.
  • Extract collection names from JSON strings.
  • Drop duplicate rows based on TMDB ID.
  • Fill missing values:
    • vote_count, vote_average → median
    • overview, tagline, belongs_to_collection → empty string
  • Total rows removed: 52

Keywords Dataset

  • Extract keyword names from JSON strings.
  • Drop duplicate rows based on TMDB ID.
  • Total rows removed: 987

Credits Dataset

  • Extract cast member names from JSON strings.
  • Extract crew member names from JSON strings.
  • Drop duplicate rows based on TMDB ID.
  • Total rows removed: 44

Ratings Dataset

  • Filter out ratings below a minimum threshold (default: 3.0).
  • Drop duplicate rating entries.
  • Total rows removed: 17834

Links Dataset

  • Drop rows with missing TMDB IDs.
  • Convert TMDB ID column to integers.
  • Drop duplicate rows based on TMDB ID.
  • Total rows removed: 13

Multiple Dataset Preprocessing (Data Aggregation):

Ratings Dataset

  • Merge ratings with links dataset on movieId.
    • Filter ratings to only include movies present in the movies dataset.
    • Total rows removed during filtering: 111

Movies Dataset

  • Merge movies with keywords dataset on tmdbId (left join).
    • Replace missing keywords with empty tuples.
    • Unmatched rows (no keywords): 1
  • Merge movies with credits dataset on tmdbId (left join).
    • Replace missing cast/crew with empty tuples.
    • Total rows removed during aggregation: 0

Exploratory Data Analysis

The following figure shows a skew distribution of genre frequencies:

Cast Member Total Frequency Box-Plot:

The following plot shows the distribution of total frequency of cast members in movies in a box plot. The distro is very skewed:

Crew Member Total Frequency Box-Plot:

The following plot shows the distribution of total frequency of crew members in movies in a box plot. It's highly skewed:

Movie Per Year Line-Plot:

The following plot shows the distribution of total number of released movies per year in a line plot. It's highly skewed:

Movie Vote Count Box-Plot:

The following plot shows the distribution of vote_count in a box-plot. It's highly skewed:

Movie Vote Average Histogram:

The following plot shows the distribution of vote_average in a histogram. It's quite normal:

User Vote Count Histogram:

The following plot shows the distribution of user vote count in a histogram. It's skewed:

Movie Vote Count Box-Plot:

The following plot shows the distribution of movie vote count (from ratings_df) in a box-plot. It's skewed:

Average Rating Per Movie Histogram:

The following plot shows the distribution of average rating per movie (from ratings_df) in a histogram. It's weird:

Average Rating Per User Histogram:

The following plot shows the distribution of average rating per user (from ratings_df) in a histogram. It's normal:

Note: The implication is that the recommendations will be biased towards high frequency genres such as Drama, Comedy, etc. Also the movies that has been released around 2000 are so much more than other years, so this increases the bias risk as well.

Models

Model Summaries:

BaselineWR (Weighted Rating)

  • Concept: Ranks movies using a weighted rating formula that balances a movie's average rating (R) with the global average rating (C), weighted by the movie's vote count (v) relative to a threshold m.
  • How it works:
    1. Compute global mean rating C and vote count threshold m (80th percentile of votes).
    2. Compute each movie's weighted rating: $ \text{imdbwr} = \frac{v}{v+m}R + \frac{m}{v+m}C $
    3. Sort movies by weighted rating to recommend top-N items.

CBRecommender (Content-Based Filtering)

  • Concept: Recommends movies based on similarity between movie features and a user’s preferences.
  • How it works:
    1. Extract features from movies (text, genres, keywords, cast, crew, collections) and weight them differently.
    2. Reduce feature dimensionality using Truncated SVD.
    3. Build a user profile by averaging features of rated movies or using cold-start genres/movie IDs.
    4. Compute cosine similarity between the user profile and all movies to rank recommendations.

UserUserCF (User-User Collaborative Filtering)

  • Concept: Finds similar users based on rating patterns and recommends items those similar users liked.
  • How it works:
    1. Construct a user-item rating matrix.
    2. Compute pairwise cosine similarity between users.
    3. For a target user, find top-K most similar users and predict scores as weighted average of their ratings.
    4. Exclude items the user has already rated and recommend the top-N items.

SurpriseSVDWrapper (Matrix Factorization with SVD)

  • Concept: Factorizes the user-item rating matrix into latent factors for users and items using Singular Value Decomposition (SVD).
  • How it works:
    1. Train the SVD model on observed ratings.
    2. Learn latent user and item vectors capturing hidden preferences/features.
    3. Predict ratings by computing dot product of user and item latent vectors.
    4. Recommend top-N items with highest predicted ratings not yet rated by the user.

Evaluation:

Based on the following figure which reports the evaluation metrics at k=10 and 20, User-User CF performs the best when we have a good amount of users already interacted with the app, and CB performs better when we have a cold start user which has just marked few movies as his/her favourites.

Bias & Fairness:

From our EDA section we can say that our data is highly skewed towards more popular genres, years of release, cast/crew members, thus our models are prone in terms of being biased towards these popular items.

References:

[1]: Surprise Library - for Matrix-Factorization Collabrative Filtering

[2]: TMDB - for Gathering Metadata About the Movies

[3]: MovieLens - for Gathering Ratings From a Collection of Users

[4]: The Movies Dataset - the Actual Page in Kaggle Which Collected the Data

About

My final project for a ML course, which was a recommender system using CB, CF (user-user & MF) approaches

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published