A comprehensive machine learning repository dedicated to predicting the Compressive Strength of Concrete. This project goes beyond simple prediction by incorporating rigorous Hyperparameter Tuning, Model Explainability (XAI), and Uncertainty Quantification to ensure robust and reliable engineering applications.
This repository implements a full data science pipeline for concrete strength analysis. It addresses the following key challenges:
- Optimization: Finding the absolute best model configurations using state-of-the-art frameworks like Optuna.
- Transparency: Using SHAP and LIME to unbox "black-box" models and understand feature impact.
- Reliability: Estimating prediction intervals using Conformal Prediction and Probabilistic Regression to gauge model confidence.
The repository is organized into four main modules. Click on the headings to navigate to the files.
1. 🗂️ Data
Contains the raw dataset split into training and testing sets.
train.csv: Historical data used for model training.test.csv: Unseen data used for final evaluation.
We employ two distinct approaches for optimization:
Hyperparameter_tuning.ipynb- Implements Grid Search, Random Search, Bayesian Optimization, and Hyperband.
- Compares convergence speed and final model performance across these methods.
Optuna_1/: Tests various Optuna Samplers (TPE, CmaEs) and Pruners (Hyperband, Median).Optuna_2/: Extended study focusing on maximizing the objective function for complex boosting models.Optuna_autosampler/: Investigates Optuna's automatic sampling capabilities.Optuna_PGBM/: Specialized tuning for Probabilistic Gradient Boosting Machines.
3. 🧠 Model Explainations
Model_explainations.ipynb- SHAP: Generates summary plots and dependence plots to show global feature importance.
- LIME: Explains individual predictions to validate model behavior on specific concrete samples.
A critical component for engineering safety.
-
- Uses MAPIE and PUNCC to generate rigorous prediction intervals (e.g., 90% confidence) with guaranteed coverage properties.
-
Probabilistic Distribution (IBUG)
- Implements NGBoost and PGBM to output full probability distributions (mean and variance) for each prediction, rather than single point estimates.
-
- Trains models to predict specific quantiles (5th and 95th percentiles) directly, providing a non-parametric way to estimate uncertainty.
The dataset comprises physical and chemical properties of concrete. The target variable is Compressive Strength.
| Feature | Description | Unit |
|---|---|---|
| C | Cement content | kg/m³ |
| mp | Mineral Admixtures / Slag | kg/m³ |
| FA | Fine Aggregate | kg/m³ |
| CA | Coarse Aggregate | kg/m³ |
| F | Fly Ash / Filler | kg/m³ |
| W_P | Water-to-Powder Ratio | Ratio |
| Adm | Admixture (Superplasticizer) | kg/m³ |
| str | Compressive Strength (Target) | MPa |
Before finalizing a model, we perform extensive tuning to minimize RMSE.
- Selection: We select algorithms like XGBoost, LightGBM, and CatBoost.
- Optimization:
- We start with Random Search to narrow the search space.
- We apply Bayesian Optimization and Optuna (TPE Sampler) to fine-tune learning rates, tree depths, and regularization parameters.
- Selection: The configuration with the best Cross-Validation score is saved for training.
To ensure the model learns physics-compliant rules (e.g., more cement usually equals higher strength):
- We run SHAP analysis on the best performing model.
- We verify that features like
C(Cement) andAdm(Admixture) have positive SHAP values. - We use LIME to audit outliers where the model predicts unusually high or low strength.
We acknowledge that no model is perfect.
-
Method A (Conformal): We generate a prediction interval
[Lower, Upper]. If the interval is too wide, the model is uncertain about that specific concrete mix. -
Method B (Probabilistic): We model the output as a Normal distribution
$\mathcal{N}(\mu, \sigma)$ . A high$\sigma$ indicates high uncertainty (aleatoric uncertainty).