| title | emoji | colorFrom | colorTo | sdk | app_port | tags | pinned | python_version | app_file | short_description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
recommender-system |
🚀 |
red |
red |
docker |
8501 |
|
false |
3.12.3 |
app/main.py |
Streamlit template space |
In this project we will focus on implementing various techniques for recommending movies to users. Techniques used include a simple method which is recommendation based on IMDB Weighted-Rating of the movies, and more sophisticated algorithms such as Content Based, K-NN User-User and Matrix Factorization Collabrative Filtering via Surprise [1].
Firstly you can go to this link to work with the app in my live space in Hugging Face. Or if you want to run the app locally, follow the procedures below:
- Clone the project:
git clone https://huggingface.co/spaces/MMDPROJECT/recommender-system. - Install the python dependencies:
pip3 install -r requirements.txt. - Run the app using streamlit:
streamlit run app/main.py.
We have used two main sources for gathering data in this work; first part which includes the metadata, keywords, and cast/crew information about the different movies in our project, and the second part which includes a large dataset of ratings to different movies.
The first part has been gathered from the immensive movies dataset of TMDB [2]. Also the second part has been provided by MovieLens [3] datascience group. The data could be fetched in one place in the kaggle page in [4].
The following table summarises the datasets that we will work with.
| Dataset Name | Summary | Columns |
|---|---|---|
| movies_metadata.csv | Contains metadata related to each of the movies from TMDB. It has 45466 entries. | belongs_to_collection, genres, id (tmdbId), imdb_id, overview, tagline, title, vote_average (in TMDB), vote_count (in TMDB) |
| credits.csv | Contains data about the cast/crew members in the movie. It has 45476 entries. | cast, crew, id (tmdbId) |
| keywords.csv | Contains keywords for each movie. It has 46419 entries. | keywords, id (tmdbId) |
| ratings.csv | Contains ratings that different users have given to movies. It has 100004 entries. | userId, movieId (id in MovieLens), rating, timestamp |
| links.csv | Contains tmdbId, and MovieLensId for each movie. It has 9125 entries. | movieId (MovieLensId), imdbId, tmdbId |
- Clean IMDb IDs by removing the 'tt' prefix.
- Drop specific invalid rows (indices: 19730, 29502, 35585).
- Drop rows with missing IMDb or TMDB IDs.
- Convert TMDB and IMDb IDs to integers.
- Extract genres from JSON strings.
- Extract collection names from JSON strings.
- Drop duplicate rows based on TMDB ID.
- Fill missing values:
vote_count,vote_average→ medianoverview,tagline,belongs_to_collection→ empty string
- Total rows removed: 52
- Extract keyword names from JSON strings.
- Drop duplicate rows based on TMDB ID.
- Total rows removed: 987
- Extract cast member names from JSON strings.
- Extract crew member names from JSON strings.
- Drop duplicate rows based on TMDB ID.
- Total rows removed: 44
- Filter out ratings below a minimum threshold (default: 3.0).
- Drop duplicate rating entries.
- Total rows removed: 17834
- Drop rows with missing TMDB IDs.
- Convert TMDB ID column to integers.
- Drop duplicate rows based on TMDB ID.
- Total rows removed: 13
- Merge ratings with links dataset on
movieId.- Filter ratings to only include movies present in the movies dataset.
- Total rows removed during filtering: 111
- Merge movies with keywords dataset on
tmdbId(left join).- Replace missing keywords with empty tuples.
- Unmatched rows (no keywords): 1
- Merge movies with credits dataset on
tmdbId(left join).- Replace missing cast/crew with empty tuples.
- Total rows removed during aggregation: 0
The following figure shows a skew distribution of genre frequencies:

The following plot shows the distribution of total frequency of cast members in movies in a box plot. The distro is very skewed:

The following plot shows the distribution of total frequency of crew members in movies in a box plot. It's highly skewed:

The following plot shows the distribution of total number of released movies per year in a line plot. It's highly skewed:

The following plot shows the distribution of vote_count in a box-plot. It's highly skewed:

The following plot shows the distribution of vote_average in a histogram. It's quite normal:

The following plot shows the distribution of user vote count in a histogram. It's skewed:

The following plot shows the distribution of movie vote count (from ratings_df) in a box-plot. It's skewed:

The following plot shows the distribution of average rating per movie (from ratings_df) in a histogram. It's weird:

The following plot shows the distribution of average rating per user (from ratings_df) in a histogram. It's normal:

Note: The implication is that the recommendations will be biased towards high frequency genres such as Drama, Comedy, etc. Also the movies that has been released around 2000 are so much more than other years, so this increases the bias risk as well.
- Concept: Ranks movies using a weighted rating formula that balances a movie's average rating (
R) with the global average rating (C), weighted by the movie's vote count (v) relative to a thresholdm. - How it works:
- Compute global mean rating
Cand vote count thresholdm(80th percentile of votes). - Compute each movie's weighted rating: $ \text{imdbwr} = \frac{v}{v+m}R + \frac{m}{v+m}C $
- Sort movies by weighted rating to recommend top-N items.
- Compute global mean rating
- Concept: Recommends movies based on similarity between movie features and a user’s preferences.
- How it works:
- Extract features from movies (text, genres, keywords, cast, crew, collections) and weight them differently.
- Reduce feature dimensionality using Truncated SVD.
- Build a user profile by averaging features of rated movies or using cold-start genres/movie IDs.
- Compute cosine similarity between the user profile and all movies to rank recommendations.
- Concept: Finds similar users based on rating patterns and recommends items those similar users liked.
- How it works:
- Construct a user-item rating matrix.
- Compute pairwise cosine similarity between users.
- For a target user, find top-K most similar users and predict scores as weighted average of their ratings.
- Exclude items the user has already rated and recommend the top-N items.
- Concept: Factorizes the user-item rating matrix into latent factors for users and items using Singular Value Decomposition (SVD).
- How it works:
- Train the SVD model on observed ratings.
- Learn latent user and item vectors capturing hidden preferences/features.
- Predict ratings by computing dot product of user and item latent vectors.
- Recommend top-N items with highest predicted ratings not yet rated by the user.
Based on the following figure which reports the evaluation metrics at k=10 and 20, User-User CF performs the best when we have a good amount of users already interacted with the app, and CB performs better when we have a cold start user which has just marked few movies as his/her favourites.
From our EDA section we can say that our data is highly skewed towards more popular genres, years of release, cast/crew members, thus our models are prone in terms of being biased towards these popular items.
[1]: Surprise Library - for Matrix-Factorization Collabrative Filtering
[2]: TMDB - for Gathering Metadata About the Movies
[3]: MovieLens - for Gathering Ratings From a Collection of Users
[4]: The Movies Dataset - the Actual Page in Kaggle Which Collected the Data
