This project focuses on building a Recommendation System using real interaction data from IBM's Watson Studio platform. The goal is to recommend articles to users based on their past behavior and similarities between articles or users.
The recommendation system aims to answer the following questions:
- Which articles are most popular overall?
- What should we recommend to a new user?
- What should we recommend to a returning user based on their reading history?
- Can we find users that are most similar to a given user?
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Make sure you have the following Python libraries installed:
pandas
numpy
sklearn
matplotlib
seaborn
scipy
1- Clone the repository:
git clone url
cd Project-Recommendation-System
2- Open the notebook file in Jupyter Notebook or Jupyter Lab
jupyter notebook Recommendations_with_IBM.ipynb
Some tests are included in the notebook using assertion checks to verify correct implementation of key functions.
- 🔍 Breakdown of Tests User-User Similarity Tests – Check if the most similar users are correctly identified.
Recommendations Tests – Ensure that recommended article IDs match expectations.
Cluster Assignments – Verify that articles are correctly mapped to clusters.
Submission Check – Export notebook and ensure outputs are correctly formatted.
The notebook includes the following sections:
- Exploratory Data Analysis – Understand the structure of the user-item interactions.
- Rank-Based Recommendations – Recommend articles based on popularity.
- User-User Based Collaborative Filtering – Recommend articles based on similar users.
- Content-Based Recommendations – Recommend articles based on clustering (e.g., KMeans).
- Matrix Factorization (SVD) – Use latent features to compute similarity.
- Extras & Conclusion – Build hybrid recommenders and polish results.
- Pandas - For data manipulation and analysis.
- NumPy - For numerical computing.
- scikit-learn - For clustering, machine learning, dimensionality reduction, and evaluation metrics, including:
- cosine_similarity - to measure similarity between items.
- KMeans- for clustering articles.
- TfidfVectorizer - to convert text data into numerical features.
- make_pipeline - to build ML pipelines.
- Normalizer - to normalize data.
- TruncatedSVD - for dimensionality reduction.
- model_evaluation- for evaluating model performance.
- Matplotlib - For visualizing data.
- Jupyter Notebook - For running and documenting Python code interactively.
- ✅ Matrix Factorization (SVD) achieved the highest accuracy at 82%.
- ✅ Content-Based Filtering proved effective for new users with no prior interactions.
- ✅ A hybrid recommendation approach is suggested for production environments to balance personalization and generalization.
- 📈 Optimal number of latent features found to be 200, providing a good trade-off between accuracy and computational cost.
Figure: Performance across different latent features.
- 📊 +28% increase in user engagement after implementing personalized recommendations.
- 📉 15% reduction in churn rate, indicating improved user retention.
Thanks to Udacity for providing this project as part of the Data Scientist Nanodegree program.
