A comprehensive machine learning framework designed to predict Suspended Sediment Load (
Accurate sediment load prediction is vital for reservoir management and hydraulic design. This repository provides a robust pipeline for analyzing hydrometric data using advanced Machine Learning regressors. Key features include:
- Advanced Optimization: Automated hyperparameter tuning using Optuna with various pruners and samplers.
- Interval Estimation: Quantile Regression to estimate the range of probable sediment loads.
- Full Distribution Modeling: Using NGBoost and PGBM to predict the entire probability distribution of the target.
- Time-Series Uncertainty: Specialized NEXCP (Non-Exchangeable Conformal Prediction) and Adaptive CP methods to handle data drift and temporal dependencies, ensuring valid coverage even during extreme events.
The project is organized into modular directories representing the analysis pipeline.
1. 🗂️ Data
Contains the hydrological time-series dataset.
train.csv: Historical data with lagged features used for model training.test.csv: Held-out data for final model evaluation.
2. 🎛️ Hyperparameter Tuning
Before training UQ models, base regressors are optimized to minimize error.
Optuna_autosampler_SEDIMENT.ipynb: Implements Bayesian Optimization via Optuna to find optimal parameters.Models/: Stores serialized (.pkl) model files (e.g.,CatBoost_HyperbandPruner_BEST.pkl).test_results.xlsx: Detailed comparative scores of different pruners and samplers.
3. 📉 Quantile Regression
Focuses on predicting conditional quantiles (e.g.,
Quantile_Regression_SEDIMENT.ipynb: Trains models to minimize Pinball Loss.Results/: Contains prediction files for XGBoost, LightGBM, CatBoost, HGBM, GPBoost, and PGBM.
Treats the target variable as a distribution to capture aleatoric uncertainty.
-
Probabilistic__Distribution_SEDIMENT.ipynb: Implements NGBoost and PGBM to output$\mu$ (mean) and$\sigma$ (standard deviation). -
Results/: Includes CRPS (Continuous Ranked Probability Score) metrics and Calibration plots.
Applies rigorous statistical calibration. This module is split into standard methods and advanced time-series methods.
Conformal Predictions(MAPIE,PUNCC)_SEDIMENT.ipynb: Standard Split Conformal Prediction and CV+.Conformal_Predictions(NEXCP, Adaptive CP, mfcs)_SEDIMENT.ipynb: Implements NEXCP and Adaptive Conformal Prediction to handle distribution shifts in sediment data.
The analysis is based on hydrometric time-series data containing current flow and historical lags:
| Feature | Description | Role |
|---|---|---|
| Qt | Discharge (Flow) at current time |
Predictor |
| Qt-1 | Discharge at previous time step |
Predictor (Lag) |
| St-1 | Sediment Load at previous time step |
Predictor (Lag) |
| St |
Target Variable: Sediment Load at time |
Target |
We utilize Optuna to optimize regressors (XGBoost, CatBoost, etc.).
- Search Space: Defined for Learning Rate, Depth, Regularization, and Estimators.
- Pruning: Tested various pruners (Hyperband, Median, SuccessiveHalving) to efficiently discard unpromising trials.
- Outcome: The best models are saved in the
Models/directory for subsequent UQ phases.
Standard regression predicts the conditional mean
- Application: We predict the 5th and 95th percentiles to create a 90% prediction interval.
- Metric: Pinball Loss (Quantile Loss).
This approach assumes
- Models: NGBoost (Natural Gradient Boosting) and PGBM.
- Analysis: Evaluates the probabilistic fit using NLL (Negative Log-Likelihood) and PIT (Probability Integral Transform) histograms to ensure the model is well-calibrated.
Uses libraries like MAPIE and PUNCC for exchangeable data.
- Methods: Jackknife+, CV+, and Split Conformal.
- Guarantee: Provides marginal coverage guarantees (e.g., 90%) over the entire dataset.
Crucial for Sediment Data: Sediment load data often violates the "exchangeability" assumption due to temporal dependencies.
-
NEXCP: Weights recent observations higher to account for distribution drift.
-
Adaptive CP: Dynamically adjusts the prediction interval width (
$C_t$ ) based on recent errors. -
Benefit: Maintains valid coverage even during "bursty" periods like floods or seasonal shifts.
**This repository is a collaborative project developed by Prakriti Bisht and Danesh Selwal.