Skip to content

hasancatalgol/linear-regression

Repository files navigation

Linear Regression Diagnostics Toolkit

This repository implements a modular Python pipeline for analyzing and validating linear regression assumptions using the Body Measurements Dataset. It focuses on statistical rigor through assumption checks, not just predictive performance.

🧠 Overview

The codebase provides tools to test and visualize the four critical assumptions of linear regression:

  1. Linearity
  2. Independence of errors
  3. Homoscedasticity (constant variance)
  4. Normality of residuals

All diagnostics are handled with reusable Python modules, no Jupyter notebooks involved.

📂 File Structure

.
├── main.py                    # Entry point for model fitting and diagnostics
├── utils.py                   # Helper functions (e.g., data cleaning, visualization)
├── check_linearity.py         # Linearity test via partial regression plots
├── check_normality.py         # Normality check (Q-Q plot, Shapiro-Wilk test)
├── check_homoscedasticity.py  # Breusch-Pagan & White tests
├── check_independence.py      # Durbin-Watson test and residual autocorrelation
└── data/                      # (Expected) Contains the CSV dataset

📊 Dataset

  • Source: Kaggle - Body Measurements Dataset
  • Content: Anthropometric data such as age, height, weight, and body part circumferences
  • Target Variable: Varies—commonly height, weight, or body fat percentage

🚀 How to Run

Make sure you have uv installed:

uv pip install -r pyproject.toml

Then run the diagnostics:

uv run main.py

✨ The pipeline will fit a linear regression model and sequentially check each assumption, printing results and plotting visuals using matplotlib and seaborn.


🧪 Diagnostic Modules

Each assumption test is cleanly separated into its own script:

check_linearity.py

  • Uses residual plots and added variable plots
  • Highlights multicollinearity issues via VIFs

check_independence.py

  • Calculates Durbin-Watson statistic
  • Plots residuals against time/index order

check_homoscedasticity.py

  • Breusch-Pagan and White tests
  • Residuals vs fitted plot with confidence bands

check_normality.py

  • Shapiro-Wilk test
  • Histogram and Q-Q plots

📦 Dependencies

All dependencies are defined in pyproject.toml and managed with uv.

Main packages used:

  • pandas, numpy
  • scikit-learn
  • statsmodels
  • seaborn, matplotlib
  • scipy

🎯 Key Highlights

  • 🧱 Modular design for each assumption
  • 🧪 Automated statistical testing + visual plots
  • 🧰 Ready for integration into larger ML workflows
  • 🔍 Emphasizes statistical validation before prediction

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages