This repository implements a modular Python pipeline for analyzing and validating linear regression assumptions using the Body Measurements Dataset. It focuses on statistical rigor through assumption checks, not just predictive performance.
The codebase provides tools to test and visualize the four critical assumptions of linear regression:
- Linearity
- Independence of errors
- Homoscedasticity (constant variance)
- Normality of residuals
All diagnostics are handled with reusable Python modules, no Jupyter notebooks involved.
.
├── main.py                    # Entry point for model fitting and diagnostics
├── utils.py                   # Helper functions (e.g., data cleaning, visualization)
├── check_linearity.py         # Linearity test via partial regression plots
├── check_normality.py         # Normality check (Q-Q plot, Shapiro-Wilk test)
├── check_homoscedasticity.py  # Breusch-Pagan & White tests
├── check_independence.py      # Durbin-Watson test and residual autocorrelation
└── data/                      # (Expected) Contains the CSV dataset
- Source: Kaggle - Body Measurements Dataset
- Content: Anthropometric data such as age, height, weight, and body part circumferences
- Target Variable: Varies—commonly height, weight, or body fat percentage
Make sure you have uv installed:
uv pip install -r pyproject.tomlThen run the diagnostics:
uv run main.py✨ The pipeline will fit a linear regression model and sequentially check each assumption, printing results and plotting visuals using
matplotlibandseaborn.
Each assumption test is cleanly separated into its own script:
- Uses residual plots and added variable plots
- Highlights multicollinearity issues via VIFs
- Calculates Durbin-Watson statistic
- Plots residuals against time/index order
- Breusch-Pagan and White tests
- Residuals vs fitted plot with confidence bands
- Shapiro-Wilk test
- Histogram and Q-Q plots
All dependencies are defined in pyproject.toml and managed with uv.
Main packages used:
- pandas,- numpy
- scikit-learn
- statsmodels
- seaborn,- matplotlib
- scipy
- 🧱 Modular design for each assumption
- 🧪 Automated statistical testing + visual plots
- 🧰 Ready for integration into larger ML workflows
- 🔍 Emphasizes statistical validation before prediction