Yelp ML Platform

A production-ready machine learning platform for business recommendations and sentiment analysis using the Yelp dataset. Built with PySpark, FastAPI, and modern MLOps practices.

Overview

This project demonstrates end-to-end ML engineering capabilities including data processing, model training, API development, containerization, and CI/CD automation. The platform processes millions of Yelp reviews to provide business recommendations and sentiment analysis through a REST API.

Key Features

Machine Learning Models

Collaborative filtering recommendation system using ALS (Alternating Least Squares)
Multi-class sentiment classification with TF-IDF and Logistic Regression
MLflow experiment tracking and model versioning

Data Engineering

Large-scale data processing with Apache Spark
ETL pipelines for JSON to Parquet transformation
Feature engineering for user and business analytics

API & Services

RESTful API built with FastAPI
Auto-generated OpenAPI documentation
Docker containerization with multi-service orchestration

MLOps & DevOps

CI/CD pipeline with GitHub Actions
Automated testing (24 unit and integration tests)
Code quality checks with Black, isort, and Flake8
Docker image building and deployment automation

Project Structure

yelp-ml-platform/
├── src/                    # Source code
│   ├── api/               # FastAPI application
│   ├── data/              # Data loading and preprocessing
│   ├── features/          # Feature engineering
│   ├── models/            # ML models
│   └── utils/             # Utility functions
├── tests/                 # Test suite
├── scripts/               # Execution scripts
├── configs/               # Configuration files
├── data/                  # Data storage
├── notebooks/             # Jupyter notebooks
└── .github/workflows/     # CI/CD workflows

For detailed project structure, see docs/ARCHITECTURE.md

Quick Start

Prerequisites

Python 3.11
Java 21 (for PySpark)
Docker and Docker Compose (optional)

Installation

Clone the repository

git clone https://github.com/rushirb2001/yelp-ml-platform.git
cd yelp-ml-platform

Set up environment

conda env create -f environment.yml
conda activate yelp-ml-platform

Download Yelp dataset

# Download from https://www.yelp.com/dataset
# Place JSON files in data/raw/

Run data processing pipeline

./scripts/run_pipeline.sh

For detailed setup instructions, see docs/SETUP.md

Usage

Running the API

Local development:

./scripts/run_api.sh

Using Docker:

./scripts/docker_run.sh

API will be available at http://localhost:8000

Interactive documentation: http://localhost:8000/docs

API Examples

Sentiment Analysis:

curl -X POST "http://localhost:8000/predict/sentiment" \
  -H "Content-Type: application/json" \
  -d '{"text": "The food was amazing and the service was excellent!"}'

Business Information:

curl -X POST "http://localhost:8000/business/info" \
  -H "Content-Type: application/json" \
  -d '{"business_id": "your-business-id"}'

For complete API documentation, see docs/API.md

Model Performance

Recommendation System (ALS)

RMSE: 3.32
MAE: 2.79

Sentiment Classifier (Logistic Regression)

Accuracy: 82.9%
F1 Score: 82.3%
Precision: 81.9%
Recall: 82.9%

For detailed model evaluation, see docs/MODELS.md

Development

Running Tests

# All tests
./scripts/run_tests.sh all

# Unit tests only
./scripts/run_tests.sh unit

# Integration tests only
./scripts/run_tests.sh integration

# With coverage report
./scripts/run_tests.sh coverage

Code Quality

# Format code
black src/ tests/
isort src/ tests/

# Lint code
flake8 src/ tests/

MLflow Tracking

# Start MLflow UI
mlflow ui --port 5001

# View experiments at http://localhost:5001

Deployment

Docker Deployment

Build and run with Docker Compose:

./scripts/docker_build.sh
./scripts/docker_run.sh

For detailed deployment instructions, see docs/DOCKER.md

CI/CD

The project includes automated workflows for:

Running tests on every push
Code quality checks
Docker image building

All workflows are defined in .github/workflows/

Technical Stack

Core Technologies:

Python 3.11
Apache Spark 3.5.0
FastAPI 0.104.1
MLflow 2.9.2

Machine Learning:

PySpark ML (ALS, Logistic Regression)
NLTK (NLP processing)
Scikit-learn compatible APIs

Infrastructure:

Docker & Docker Compose
GitHub Actions
Pytest

Data Processing:

Pandas 2.1.0
NumPy 1.26.0

Project Timeline

This project was developed over 16 weeks (November 2024 - March 2025) following a structured development plan with distinct phases for data engineering, ML model development, API creation, testing, and deployment automation.

License

This project is licensed under a Custom Research and Educational License.

Key Points:

View and study the code freely
Use for educational purposes
Reference in academic papers
Copying/forking requires written permission
Commercial use requires written permission
- Modification and redistribution require written permission

To request permission: Contact rushirbhavsar@gmail.com

See the LICENSE file for complete terms.

Acknowledgments

Yelp for providing the open dataset
Apache Spark and PySpark communities
FastAPI and Uvicorn teams
MLflow for experiment tracking capabilities
Open-source contributors and maintainers

References

Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." Computer, 42(8), 30-37.
Zaharia, M., et al. (2016). "Apache Spark: A Unified Engine for Big Data Processing." Communications of the ACM, 59(11), 56-65.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). "Applied Logistic Regression." Wiley Series in Probability and Statistics.
Yelp Dataset. (2024). "Yelp Open Dataset." Retrieved from https://www.yelp.com/dataset

Support

For questions, issues, or suggestions:

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
configs		configs
data		data
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Yelp ML Platform

Overview

Key Features

Project Structure

Quick Start

Prerequisites

Installation

Usage

Running the API

API Examples

Model Performance

Development

Running Tests

Code Quality

MLflow Tracking

Deployment

Docker Deployment

CI/CD

Technical Stack

Project Timeline

License

Acknowledgments

References

Support

About

Uh oh!

Releases

Packages

Languages

License

rushirb2001/yelp-ml-platform

Folders and files

Latest commit

History

Repository files navigation

Yelp ML Platform

Overview

Key Features

Project Structure

Quick Start

Prerequisites

Installation

Usage

Running the API

API Examples

Model Performance

Development

Running Tests

Code Quality

MLflow Tracking

Deployment

Docker Deployment

CI/CD

Technical Stack

Project Timeline

License

Acknowledgments

References

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages