Skip to content

Advanced regression analysis suite featuring KNN optimization, multi-algorithm comparison, hyperparameter tuning with Optuna, and production-ready ML pipelines with comprehensive model evaluation and visualization.

Notifications You must be signed in to change notification settings

mwasifanwar/ml-ensemble-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML Ensemble Regression Studio

A comprehensive machine learning framework for advanced regression analysis with multi-algorithm comparison and hyperparameter optimization.

Overview

ML Ensemble Regression Studio is a sophisticated machine learning framework designed for comprehensive regression analysis. This toolkit provides robust pipelines for model comparison, hyperparameter optimization, and production deployment. Built with software engineering best practices, it demonstrates how to structure machine learning projects for maintainability, reproducibility, and scalability.

The framework supports multiple regression algorithms with advanced optimization techniques, comprehensive evaluation metrics, and professional visualization capabilities. It serves as both a practical tool for data scientists and a reference implementation for ML engineering patterns.

Architecture

The system follows a component-based architecture where each module has a single responsibility and well-defined interfaces:

Data Layer → Preprocessing → Model Training → Evaluation → Visualization
    ↓             ↓              ↓             ↓            ↓
DataLoader   Preprocessor   ModelTrainer   Evaluator   Visualizer
image

Features

Core Capabilities

  • Multi-algorithm regression comparison including KNN, Linear Regression, Ridge/Lasso, Decision Trees, Random Forest, Gradient Boosting, and Support Vector Regression
  • Bayesian hyperparameter optimization using Optuna for efficient parameter search
  • Comprehensive model evaluation with cross-validation and statistical testing
  • Advanced visualization for model comparison, residual analysis, and feature importance

Technical Features

  • Modular pipeline architecture with clean separation of concerns
  • Automated data validation and preprocessing with multiple scaling strategies
  • Polynomial feature generation and interaction terms
  • Model serialization for deployment using Joblib
  • Containerization support with Docker
  • Experiment tracking with MLflow integration

Production Ready

  • Comprehensive test suite with unit and integration tests
  • Continuous integration with GitHub Actions
  • Configuration management through YAML and dataclasses
  • Professional documentation and examples
  • Cross-platform compatibility

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • 4GB RAM minimum (8GB recommended for larger datasets)

Quick Installation

git clone https://github.com/mwasifanwar/ml-ensemble-regression
cd ML-Ensemble-Regression-Studio
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Development Installation

pip install -r requirements.txt
pip install pytest pytest-cov flake8 black
pytest tests/ -v  # Verify installation

Usage

Basic Pipeline Execution

python src/main.py

This executes the complete analysis pipeline:

  • Loads and validates the sample dataset
  • Performs exploratory data analysis
  • Creates features and scales data
  • Optimizes hyperparameters for all models
  • Trains and evaluates regression algorithms
  • Generates visualizations and saves artifacts

Programmatic Usage

from src import ModelPipeline, ModelConfig

config = ModelConfig(test_size=0.2, random_state=42, cv_folds=5, n_trials=100) pipeline = ModelPipeline(config) results = pipeline.run('your_dataset.csv')

best_model_name, best_results = pipeline.trainer.get_best_model() print(f"Best model: {best_model_name}, RMSE: {best_results['metrics']['rmse']:.2f}")

Model Inference

import joblib
import pandas as pd

model = joblib.load('models/saved_models/best_model.pkl') scaler = joblib.load('models/saved_models/scaler.pkl')

new_data = pd.DataFrame({'feature1': [1.5, 2.8], 'feature2': [3.2, 4.1]}) new_data_scaled = scaler.transform(new_data) predictions = model.predict(new_data_scaled)

Configuration

Model Configuration

The framework uses a hierarchical configuration system:

@dataclass
class ModelConfig:
    test_size: float = 0.2
    random_state: int = 42
    cv_folds: int = 5
    n_trials: int = 100

Hyperparameter Search Spaces

Each algorithm has optimized search spaces:

# KNN Search Space
{
    'n_neighbors': range(1, 51),
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2, 3]
}

Random Forest Search Space

{ 'n_estimators': range(50, 301, 50), 'max_depth': range(3, 21), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 11) }

Folder Structure

ML-Ensemble-Regression-Studio/
├── src/                          # Source code
│   ├── data_loader.py           # Data loading and validation
│   ├── preprocessor.py          # Feature engineering
│   ├── hyperparameter_tuner.py  # Optuna optimization
│   ├── model_trainer.py         # Multi-model training
│   ├── visualization.py         # Advanced plotting
│   └── main.py                  # Pipeline orchestration
├── tests/                       # Test suite
│   ├── test_data_loader.py
│   ├── test_preprocessor.py
│   └── test_model_trainer.py
├── models/                      # Generated artifacts
│   ├── saved_models/           # Serialized models
│   └── model_comparison/       # Results and plots
├── data/                       # Data directory
│   └── Salary_dataset.csv      # Example dataset
├── config/                     # Configuration files
│   └── config.yaml             # YAML configuration
├── notebooks/                  # Jupyter notebooks
│   ├── 01_eda.ipynb           # Exploratory analysis
│   └── 02_model_training.ipynb # Model experiments
├── requirements.txt            # Python dependencies
├── setup.py                   # Package installation
├── Dockerfile                 # Container configuration
└── README.md                  # Documentation

Roadmap

Short-term Enhancements

  • Integration of additional algorithms including XGBoost and LightGBM
  • Automated feature selection and engineering
  • Enhanced model interpretation with SHAP and LIME
  • Time series cross-validation support
  • Hyperparameter search space customization

Medium-term Vision

  • Distributed training support for large datasets
  • Automated model documentation generation
  • REST API for model serving and inference
  • Real-time model monitoring and drift detection
  • Automated report generation for stakeholders

Long-term Objectives

  • Federated learning capabilities for privacy preservation
  • Multi-modal data support including text and image features
  • Automated machine learning (AutoML) pipeline
  • Cloud deployment templates for major platforms
  • Advanced anomaly detection and data quality monitoring

Acknowledgments

This project builds upon established best practices in machine learning engineering and leverages the excellent open-source ecosystem around Python data science. Special thanks to the contributors of Scikit-learn, Optuna, MLflow, and the broader scientific Python community for maintaining the foundational tools that make projects like this possible.


✨ Author

M Wasif Anwar
AI/ML Engineer | Effixly AI

LinkedIn Email Website

⭐ *"Transforming raw data into predictive insights through elegant machine learning architecture."*



⭐ Don't forget to star this repository if you find it helpful!

About

Advanced regression analysis suite featuring KNN optimization, multi-algorithm comparison, hyperparameter tuning with Optuna, and production-ready ML pipelines with comprehensive model evaluation and visualization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages