End-to-end analysis of the Netflix catalogue — EDA, genre analysis, a collaborative-filtering recommender, and IMDb rating prediction.
netflix-analysis/
├── netflix_analysis.py # Python implementation (self-contained, demo data)
├── netflix_analysis.R # Cleaned R script (requires real datasets)
├── Outputs/ # All generated plots
└── README.md
pip install pandas numpy matplotlib seaborn scikit-learn wordcloud scipy
python netflix_analysis.py# Datasets needed in working directory:
# netflix_titles.csv · IMDb ratings.csv · IMDb movies.csv
source("netflix_analysis.R")| # | Section | Key Output |
|---|---|---|
| 1 | Data Loading & Merging | Netflix + IMDb joint dataset |
| 2 | Preprocessing | Deduplication, mode-filling, date parsing |
| 3 | EDA & Visualisations | 9 plots (see below) |
| 4 | Recommendation System | IBCF top-10 per user |
| 5 | Predictive Modelling | Linear Regression + Gradient Boosting |
| 6 | TF-IDF Text Analysis | Top terms per title |
70 % Movies vs 30 % TV Shows on the platform.
United States dominates, followed by India and the United Kingdom.
Rapid library expansion between 2015 and 2020.
TV-MA and TV-14 account for the majority of content.
Drama and Comedy are the most frequent categories.
Indian movies tend to be the longest; South Korean titles the shortest.
Drama + Comedy and Drama + Thriller are the most common genre pairings.
Item-Based Collaborative Filtering (IBCF) using cosine similarity on a user × item rating matrix. The model finds the k most similar items to those a user has already rated and surfaces the highest-scoring unseen titles.
Sample recommendations for User 0:
1. Title_257 6. Title_135
2. Title_98 7. Title_179
3. Title_107 8. Title_258
4. Title_96 9. Title_10
5. Title_176 10. Title_285
Two models are trained to predict IMDb weighted_average_vote from metadata features:
| Model | RMSE |
|---|---|
| Linear Regression | 1.23 |
| Gradient Boosting | 1.25 |
TF-IDF scores surface the most distinctive words per title, beyond simple frequency.
| Component | Python | R |
|---|---|---|
| Data wrangling | pandas, numpy |
tidyverse, data.table |
| Visualisation | matplotlib, seaborn |
ggplot2, plotly |
| Recommendation | scikit-learn (cosine sim) |
recommenderlab (IBCF) |
| ML modelling | scikit-learn (LR + GBM) |
caret, gbm |
| Text analysis | scikit-learn TF-IDF |
tm, tidytext |
| Dataset | Source |
|---|---|
| Netflix Titles | Kaggle — Netflix Movies and TV Shows |
| IMDb Ratings | Kaggle — IMDb movies extensive dataset |
Note: The Python script ships with synthetic demo data so it runs without any downloads.













