Content-Based Restaurant Recommender: SBERT,XGBoost & MMR Pipeline

Presentation Slides

Repository Structure

restaurant-recommender/
├── src/                        # Pipeline scripts (01-09)
│   ├── config.py               # Central configuration (Not included)
│   ├── 01_yelp_etl.py          # Data extraction 
│   ├── 02_spatial_etl.py       # Spatial clustering 
│   ├── 03_sentiment.py         # Hybrid VADER + Transformer pipeline
│   ├── 04_sbert.py             # Hierarchical SBERT Chunking implementation
│   ├── 05_similarity_pop.py    # Semantic similarity matrix computation
│   ├── 06_feature_builder.py   # Final feature vector assembly (36-dim)
│   ├── 07_model_training.py    # Model training
│   ├── 08_ablation.py          # Ablation study
│   └── 09_mmr.py               # MMR evaluation
│
├── data/processed/             # Feature datasets
│   ├── features_all.parquet
│   └── restaurant_embeddings.parquet
│
├── models/                     # Trained checkpoints
│   ├── ols_model.joblib
│   ├── xgb_model.joblib
│   ├── dnn_model.keras
│   └── scaler.joblib
│
├── figures/                    # Visualizations
│
├── requirements.txt
├── run_models.py               # Quick-start (Local)
└── run_Models.ipynb            # Quick-start (Colab)

Quick Start

Option 1: Run with Pre-processed Data

# Clone repository
git clone https://github.com/YingWana/Restaurant-Recommender.git
cd restaurant-recommender

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run model training & evaluation
python run_models.py

Option 2: Google Colab (Recommended)

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/capstone
!pip install -q xgboost sentence-transformers vaderSentiment
!python run_models.py

Note: This project was developed and tested using Google Colab with the following configuration:

Runtime type: Python 3
Hardware accelerator: NVIDIA T4 GPU
RAM: High-RAM runtime
Runtime version: Latest available at execution time

Feature Engineering

36-Feature Summary

Category	Features	Count	Description
Semantic	X_sim	1	Cosine similarity (two-phase design)
Sentiment	X_sent_mean, X_sent_std, X_sent_skew	3	Three-moment distribution
Popularity	X_pop	1	log(review_count + 1)
Amenities	TakeOut, Delivery, OutdoorSeating, etc.	8	Binary attributes
Cuisine	American, Mexican, Italian, etc.	22	One-hot encoded
Spatial	neighborhood_type	1	K-Means cluster
Total		36

Dataset

Metric	Value
Total Restaurants	10,841
Total Reviews	2,390,705
Geographic Coverage	PA, FL, TN, AZ, LA
Minimum Reviews	50 per restaurant
Train / Val / Test Split	70% / 15% / 15%
Test Set Size	n = 1,627
Reviews > 512 tokens	78,043 (2.6%)

Source: Yelp Open Dataset (2024)

Requirements

# --- Core Data Science ---
numpy==1.26.4
pandas==2.2.3
scipy==1.14.1

# --- Machine Learning ---
scikit-learn==1.5.2
xgboost==2.1.1

# --- Deep Learning ---
# For standard systems:
tensorflow>=2.16.0
# For Apple Silicon, replace with:
# tensorflow-macos==2.16.2
# tensorflow-metal==1.2.0

# --- NLP ---
nltk==3.9.1
vaderSentiment==3.3.2
sentence-transformers==3.0.1
transformers==4.45.2
torch>=2.0.0

# --- Visualization ---
matplotlib==3.9.2
seaborn==0.13.2

# --- Utilities ---
tqdm==4.66.5
joblib>=1.3.0
pyarrow>=14.0.0

Apple Silicon (M1/M2/M3) Installation

pip install tensorflow-macos tensorflow-metal

References

#	Citation
[1]	Hutto & Gilbert (2014). "VADER: A parsimonious rule-based model for sentiment analysis." ICWSM
[2]	Reimers & Gurevych (2019). "Sentence-BERT: Sentence embeddings using siamese BERT-networks." EMNLP
[3]	Chen & Guestrin (2016). "XGBoost: A scalable tree boosting system." KDD
[4]	Abadi et al. (2016). "TensorFlow: A system for large-scale machine learning." OSDI
[5]	Carbonell & Goldstein (1998). "MMR diversity-based reranking." SIGIR
[6]	Liu (2009). "Learning to rank for information retrieval." FnTIR
[7]	Yelp Open Dataset (2024). https://www.yelp.com/dataset
[8]	Ng (2021). "Data-Centric AI." DeepLearning.AI

Additional Acknowledgments

Yelp for providing the Open Dataset
Sentence-Transformers team for the SBERT implementation
NVIDIA for CUDA/GPU acceleration support

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data/processed		data/processed
figures		figures
models		models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_models.ipynb		run_models.ipynb
run_models.py		run_models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content-Based Restaurant Recommender: SBERT,XGBoost & MMR Pipeline

Presentation Slides

Repository Structure

Quick Start

Option 1: Run with Pre-processed Data

Option 2: Google Colab (Recommended)

Feature Engineering

36-Feature Summary

Dataset

Requirements

References

Additional Acknowledgments

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content-Based Restaurant Recommender: SBERT,XGBoost & MMR Pipeline

Presentation Slides

Repository Structure

Quick Start

Option 1: Run with Pre-processed Data

Option 2: Google Colab (Recommended)

Feature Engineering

36-Feature Summary

Dataset

Requirements

References

Additional Acknowledgments

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages