A production-ready machine learning platform for business recommendations and sentiment analysis using the Yelp dataset. Built with PySpark, FastAPI, and modern MLOps practices.
This project demonstrates end-to-end ML engineering capabilities including data processing, model training, API development, containerization, and CI/CD automation. The platform processes millions of Yelp reviews to provide business recommendations and sentiment analysis through a REST API.
Machine Learning Models
- Collaborative filtering recommendation system using ALS (Alternating Least Squares)
- Multi-class sentiment classification with TF-IDF and Logistic Regression
- MLflow experiment tracking and model versioning
Data Engineering
- Large-scale data processing with Apache Spark
- ETL pipelines for JSON to Parquet transformation
- Feature engineering for user and business analytics
API & Services
- RESTful API built with FastAPI
- Auto-generated OpenAPI documentation
- Docker containerization with multi-service orchestration
MLOps & DevOps
- CI/CD pipeline with GitHub Actions
- Automated testing (24 unit and integration tests)
- Code quality checks with Black, isort, and Flake8
- Docker image building and deployment automation
yelp-ml-platform/
├── src/ # Source code
│ ├── api/ # FastAPI application
│ ├── data/ # Data loading and preprocessing
│ ├── features/ # Feature engineering
│ ├── models/ # ML models
│ └── utils/ # Utility functions
├── tests/ # Test suite
├── scripts/ # Execution scripts
├── configs/ # Configuration files
├── data/ # Data storage
├── notebooks/ # Jupyter notebooks
└── .github/workflows/ # CI/CD workflows
For detailed project structure, see docs/ARCHITECTURE.md
- Python 3.11
- Java 21 (for PySpark)
- Docker and Docker Compose (optional)
- Clone the repository
git clone https://github.com/rushirb2001/yelp-ml-platform.git
cd yelp-ml-platform- Set up environment
conda env create -f environment.yml
conda activate yelp-ml-platform- Download Yelp dataset
# Download from https://www.yelp.com/dataset
# Place JSON files in data/raw/- Run data processing pipeline
./scripts/run_pipeline.shFor detailed setup instructions, see docs/SETUP.md
Local development:
./scripts/run_api.shUsing Docker:
./scripts/docker_run.shAPI will be available at http://localhost:8000
Interactive documentation: http://localhost:8000/docs
Sentiment Analysis:
curl -X POST "http://localhost:8000/predict/sentiment" \
-H "Content-Type: application/json" \
-d '{"text": "The food was amazing and the service was excellent!"}'Business Information:
curl -X POST "http://localhost:8000/business/info" \
-H "Content-Type: application/json" \
-d '{"business_id": "your-business-id"}'For complete API documentation, see docs/API.md
Recommendation System (ALS)
- RMSE: 3.32
- MAE: 2.79
Sentiment Classifier (Logistic Regression)
- Accuracy: 82.9%
- F1 Score: 82.3%
- Precision: 81.9%
- Recall: 82.9%
For detailed model evaluation, see docs/MODELS.md
# All tests
./scripts/run_tests.sh all
# Unit tests only
./scripts/run_tests.sh unit
# Integration tests only
./scripts/run_tests.sh integration
# With coverage report
./scripts/run_tests.sh coverage# Format code
black src/ tests/
isort src/ tests/
# Lint code
flake8 src/ tests/# Start MLflow UI
mlflow ui --port 5001
# View experiments at http://localhost:5001Build and run with Docker Compose:
./scripts/docker_build.sh
./scripts/docker_run.shFor detailed deployment instructions, see docs/DOCKER.md
The project includes automated workflows for:
- Running tests on every push
- Code quality checks
- Docker image building
All workflows are defined in .github/workflows/
Core Technologies:
- Python 3.11
- Apache Spark 3.5.0
- FastAPI 0.104.1
- MLflow 2.9.2
Machine Learning:
- PySpark ML (ALS, Logistic Regression)
- NLTK (NLP processing)
- Scikit-learn compatible APIs
Infrastructure:
- Docker & Docker Compose
- GitHub Actions
- Pytest
Data Processing:
- Pandas 2.1.0
- NumPy 1.26.0
This project was developed over 16 weeks (November 2024 - March 2025) following a structured development plan with distinct phases for data engineering, ML model development, API creation, testing, and deployment automation.
This project is licensed under a Custom Research and Educational License.
Key Points:
- View and study the code freely
- Use for educational purposes
- Reference in academic papers
- Copying/forking requires written permission
- Commercial use requires written permission
- Modification and redistribution require written permission
To request permission: Contact rushirbhavsar@gmail.com
See the LICENSE file for complete terms.
- Yelp for providing the open dataset
- Apache Spark and PySpark communities
- FastAPI and Uvicorn teams
- MLflow for experiment tracking capabilities
- Open-source contributors and maintainers
-
Koren, Y., Bell, R., & Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." Computer, 42(8), 30-37.
-
Zaharia, M., et al. (2016). "Apache Spark: A Unified Engine for Big Data Processing." Communications of the ACM, 59(11), 56-65.
-
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). "Applied Logistic Regression." Wiley Series in Probability and Statistics.
-
Yelp Dataset. (2024). "Yelp Open Dataset." Retrieved from https://www.yelp.com/dataset
For questions, issues, or suggestions:
- Email: rushirbhavsar@gmail.com
- Issues: GitHub Issues
- Discussions: GitHub Discussions