Skip to content

End-to-end ML platform for Yelp business recommendations and sentiment analysis. Features collaborative filtering (ALS), NLP classification, FastAPI REST API, PySpark data processing, MLflow tracking, Docker deployment, and CI/CD automation. Academic/research project demonstrating production ML engineering.

License

Notifications You must be signed in to change notification settings

rushirb2001/yelp-ml-platform

Repository files navigation

Yelp ML Platform

A production-ready machine learning platform for business recommendations and sentiment analysis using the Yelp dataset. Built with PySpark, FastAPI, and modern MLOps practices.

Tests Lint Docker Build Python 3.11


Overview

This project demonstrates end-to-end ML engineering capabilities including data processing, model training, API development, containerization, and CI/CD automation. The platform processes millions of Yelp reviews to provide business recommendations and sentiment analysis through a REST API.


Key Features

Machine Learning Models

  • Collaborative filtering recommendation system using ALS (Alternating Least Squares)
  • Multi-class sentiment classification with TF-IDF and Logistic Regression
  • MLflow experiment tracking and model versioning

Data Engineering

  • Large-scale data processing with Apache Spark
  • ETL pipelines for JSON to Parquet transformation
  • Feature engineering for user and business analytics

API & Services

  • RESTful API built with FastAPI
  • Auto-generated OpenAPI documentation
  • Docker containerization with multi-service orchestration

MLOps & DevOps

  • CI/CD pipeline with GitHub Actions
  • Automated testing (24 unit and integration tests)
  • Code quality checks with Black, isort, and Flake8
  • Docker image building and deployment automation

Project Structure

yelp-ml-platform/
├── src/                    # Source code
│   ├── api/               # FastAPI application
│   ├── data/              # Data loading and preprocessing
│   ├── features/          # Feature engineering
│   ├── models/            # ML models
│   └── utils/             # Utility functions
├── tests/                 # Test suite
├── scripts/               # Execution scripts
├── configs/               # Configuration files
├── data/                  # Data storage
├── notebooks/             # Jupyter notebooks
└── .github/workflows/     # CI/CD workflows

For detailed project structure, see docs/ARCHITECTURE.md


Quick Start

Prerequisites

  • Python 3.11
  • Java 21 (for PySpark)
  • Docker and Docker Compose (optional)

Installation

  1. Clone the repository
git clone https://github.com/rushirb2001/yelp-ml-platform.git
cd yelp-ml-platform
  1. Set up environment
conda env create -f environment.yml
conda activate yelp-ml-platform
  1. Download Yelp dataset
# Download from https://www.yelp.com/dataset
# Place JSON files in data/raw/
  1. Run data processing pipeline
./scripts/run_pipeline.sh

For detailed setup instructions, see docs/SETUP.md


Usage

Running the API

Local development:

./scripts/run_api.sh

Using Docker:

./scripts/docker_run.sh

API will be available at http://localhost:8000

Interactive documentation: http://localhost:8000/docs

API Examples

Sentiment Analysis:

curl -X POST "http://localhost:8000/predict/sentiment" \
  -H "Content-Type: application/json" \
  -d '{"text": "The food was amazing and the service was excellent!"}'

Business Information:

curl -X POST "http://localhost:8000/business/info" \
  -H "Content-Type: application/json" \
  -d '{"business_id": "your-business-id"}'

For complete API documentation, see docs/API.md

Model Performance

Recommendation System (ALS)

  • RMSE: 3.32
  • MAE: 2.79

Sentiment Classifier (Logistic Regression)

  • Accuracy: 82.9%
  • F1 Score: 82.3%
  • Precision: 81.9%
  • Recall: 82.9%

For detailed model evaluation, see docs/MODELS.md


Development

Running Tests

# All tests
./scripts/run_tests.sh all

# Unit tests only
./scripts/run_tests.sh unit

# Integration tests only
./scripts/run_tests.sh integration

# With coverage report
./scripts/run_tests.sh coverage

Code Quality

# Format code
black src/ tests/
isort src/ tests/

# Lint code
flake8 src/ tests/

MLflow Tracking

# Start MLflow UI
mlflow ui --port 5001

# View experiments at http://localhost:5001

Deployment

Docker Deployment

Build and run with Docker Compose:

./scripts/docker_build.sh
./scripts/docker_run.sh

For detailed deployment instructions, see docs/DOCKER.md

CI/CD

The project includes automated workflows for:

  • Running tests on every push
  • Code quality checks
  • Docker image building

All workflows are defined in .github/workflows/


Technical Stack

Core Technologies:

  • Python 3.11
  • Apache Spark 3.5.0
  • FastAPI 0.104.1
  • MLflow 2.9.2

Machine Learning:

  • PySpark ML (ALS, Logistic Regression)
  • NLTK (NLP processing)
  • Scikit-learn compatible APIs

Infrastructure:

  • Docker & Docker Compose
  • GitHub Actions
  • Pytest

Data Processing:

  • Pandas 2.1.0
  • NumPy 1.26.0

Project Timeline

This project was developed over 16 weeks (November 2024 - March 2025) following a structured development plan with distinct phases for data engineering, ML model development, API creation, testing, and deployment automation.


License

This project is licensed under a Custom Research and Educational License.

Key Points:

  • View and study the code freely
  • Use for educational purposes
  • Reference in academic papers
  • Copying/forking requires written permission
  • Commercial use requires written permission
    • Modification and redistribution require written permission

To request permission: Contact rushirbhavsar@gmail.com

See the LICENSE file for complete terms.


Acknowledgments

  • Yelp for providing the open dataset
  • Apache Spark and PySpark communities
  • FastAPI and Uvicorn teams
  • MLflow for experiment tracking capabilities
  • Open-source contributors and maintainers

References

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." Computer, 42(8), 30-37.

  2. Zaharia, M., et al. (2016). "Apache Spark: A Unified Engine for Big Data Processing." Communications of the ACM, 59(11), 56-65.

  3. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). "Applied Logistic Regression." Wiley Series in Probability and Statistics.

  4. Yelp Dataset. (2024). "Yelp Open Dataset." Retrieved from https://www.yelp.com/dataset


Support

For questions, issues, or suggestions:


About

End-to-end ML platform for Yelp business recommendations and sentiment analysis. Features collaborative filtering (ALS), NLP classification, FastAPI REST API, PySpark data processing, MLflow tracking, Docker deployment, and CI/CD automation. Academic/research project demonstrating production ML engineering.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published