Skip to content

DaneshSelwal/Concrete_data_analysis

Repository files navigation

🏗️ Concrete Data Analysis & Strength Prediction

Python Jupyter Status

A comprehensive machine learning repository dedicated to predicting the Compressive Strength of Concrete. This project goes beyond simple prediction by incorporating rigorous Hyperparameter Tuning, Model Explainability (XAI), and Uncertainty Quantification to ensure robust and reliable engineering applications.


📑 Table of Contents (Navigation)

  1. 📌 Project Overview
  2. 📂 Repository Structure
  3. 📊 Dataset Details
  4. 🛠️ Workflow & Methodology

📌 Project Overview

This repository implements a full data science pipeline for concrete strength analysis. It addresses the following key challenges:

  • Optimization: Finding the absolute best model configurations using state-of-the-art frameworks like Optuna.
  • Transparency: Using SHAP and LIME to unbox "black-box" models and understand feature impact.
  • Reliability: Estimating prediction intervals using Conformal Prediction and Probabilistic Regression to gauge model confidence.

📂 Repository Structure

The repository is organized into four main modules. Click on the headings to navigate to the files.

1. 🗂️ Data

Contains the raw dataset split into training and testing sets.

  • train.csv: Historical data used for model training.
  • test.csv: Unseen data used for final evaluation.

2. 🎛️ Hyperparameter Tuning

We employ two distinct approaches for optimization:

  • Hyperparameter_tuning.ipynb
    • Implements Grid Search, Random Search, Bayesian Optimization, and Hyperband.
    • Compares convergence speed and final model performance across these methods.
  • Optuna_1/: Tests various Optuna Samplers (TPE, CmaEs) and Pruners (Hyperband, Median).
  • Optuna_2/: Extended study focusing on maximizing the objective function for complex boosting models.
  • Optuna_autosampler/: Investigates Optuna's automatic sampling capabilities.
  • Optuna_PGBM/: Specialized tuning for Probabilistic Gradient Boosting Machines.
  • Model_explainations.ipynb
    • SHAP: Generates summary plots and dependence plots to show global feature importance.
    • LIME: Explains individual predictions to validate model behavior on specific concrete samples.

A critical component for engineering safety.

  • Conformal Predictions

    • Uses MAPIE and PUNCC to generate rigorous prediction intervals (e.g., 90% confidence) with guaranteed coverage properties.
  • Probabilistic Distribution (IBUG)

    • Implements NGBoost and PGBM to output full probability distributions (mean and variance) for each prediction, rather than single point estimates.
  • Quantile Regression

    • Trains models to predict specific quantiles (5th and 95th percentiles) directly, providing a non-parametric way to estimate uncertainty.

📊 Dataset Details

The dataset comprises physical and chemical properties of concrete. The target variable is Compressive Strength.

Feature Description Unit
C Cement content kg/m³
mp Mineral Admixtures / Slag kg/m³
FA Fine Aggregate kg/m³
CA Coarse Aggregate kg/m³
F Fly Ash / Filler kg/m³
W_P Water-to-Powder Ratio Ratio
Adm Admixture (Superplasticizer) kg/m³
str Compressive Strength (Target) MPa

🛠️ Workflow & Methodology

Phase 1: Hyperparameter Tuning

Before finalizing a model, we perform extensive tuning to minimize RMSE.

  1. Selection: We select algorithms like XGBoost, LightGBM, and CatBoost.
  2. Optimization:
    • We start with Random Search to narrow the search space.
    • We apply Bayesian Optimization and Optuna (TPE Sampler) to fine-tune learning rates, tree depths, and regularization parameters.
  3. Selection: The configuration with the best Cross-Validation score is saved for training.

Phase 2: Model Explainability

To ensure the model learns physics-compliant rules (e.g., more cement usually equals higher strength):

  • We run SHAP analysis on the best performing model.
  • We verify that features like C (Cement) and Adm (Admixture) have positive SHAP values.
  • We use LIME to audit outliers where the model predicts unusually high or low strength.

Phase 3: Uncertainty Analysis

We acknowledge that no model is perfect.

  • Method A (Conformal): We generate a prediction interval [Lower, Upper]. If the interval is too wide, the model is uncertain about that specific concrete mix.
  • Method B (Probabilistic): We model the output as a Normal distribution $\mathcal{N}(\mu, \sigma)$. A high $\sigma$ indicates high uncertainty (aleatoric uncertainty).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •