A production-oriented collaborative filtering system built using the Yelp dataset.
This project implements Matrix Factorization from scratch and also using an optimized library (cmfrec) to demonstrate both theoretical depth and practical scalability.
To build a personalized recommendation engine that:
- Learns latent user and business preferences
- Predicts unseen ratings
- Generates Top-K recommendations
- Evaluates ranking quality using Precision@K and Recall@K
- Handles real-world challenges like sparsity and cold start
This project reflects how modern recommendation systems (Swiggy, Zomato, Netflix, Amazon) operate under the hood.
├── Recommender_System_MF.ipynb # Matrix Factorization from scratch
├── Recommender_system_cmfrec.ipynb # Optimized MF using cmfrec
Source: Yelp Academic Dataset
Files Used:
review.jsonbusiness.json
Key Columns:
user_idbusiness_idstars
The dataset is highly sparse, simulating real-world recommendation scenarios.
Notebook: Recommender_System_MF.ipynb
- Data ingestion from JSON
- ID encoding for users and businesses
- Train-test split (preventing data leakage)
- Construction of interaction matrix
- Matrix Factorization using Gradient Descent
- Regularization to prevent overfitting
- Full prediction matrix computation
- Evaluation using ranking metrics
We approximate the rating matrix:
R ≈ P × Qᵀ
Where:
P→ User latent matrixQ→ Business latent matrixK→ Number of latent dimensions
Each user and business is represented in a shared latent space capturing hidden behavioral patterns.
Notebook: Recommender_system_cmfrec.ipynb
This implementation demonstrates:
- Efficient large-scale training
- Faster convergence
- Production-grade matrix factorization
- Better scalability for real-world deployment
This mirrors how recommender systems are implemented in industry environments.
Instead of relying only on regression error, this system evaluates ranking performance:
- RMSE
- Precision@K
- Recall@K
This ensures the model optimizes for recommendation relevance, not just numerical accuracy.
- Sparse data handling
- Hyperparameter tuning (K, learning rate, regularization)
- Vectorized operations for performance
- Proper encoding to maintain ID consistency
- Cold-start strategy discussion (global average, bias models)
- Collaborative Filtering
- Matrix Factorization (Gradient Descent)
- Ranking Metric Evaluation
- Data Leakage Prevention
- Sparse Data Optimization
- Model Evaluation Strategy
- Production-Oriented Thinking
- Python
- NumPy
- Pandas
- Matplotlib
- cmfrec
- Google Colab
- Clone the repository
- Download the Yelp dataset
- Update dataset paths inside the notebooks
- Run notebooks sequentially
- Implement Alternating Least Squares (ALS)
- Add implicit feedback modeling
- Deploy model via FastAPI
- Add MLflow model tracking
- Scale using sparse matrix frameworks
- Implements Matrix Factorization from scratch (not just library usage)
- Uses proper ranking metrics (Precision@K, Recall@K)
- Addresses real-world challenges like sparsity and cold start
- Compares theoretical and optimized implementations
This project demonstrates strong fundamentals in recommender systems and applied machine learning.
Aasim
Focused on building scalable ML systems with a strong foundation in recommendation algorithms and behavioral data modeling.