This repository contains the implementation of experiments from the paper "Uncertainty Quantification of Surrogate Models using Conformal Prediction". The code provides a framework for obtaining statistically guaranteed error bars on neural network predictions for spatio-temporal PDE systems using conformal prediction (CP). Data and trained Models can be found in the google drive link
Conformal prediction provides distribution-free, model-agnostic uncertainty quantification with guaranteed coverage. This repository demonstrates CP across four classical PDEs, showing how to:
- Train surrogate models (MLPs, U-Nets, FNOs) for PDE systems
- Apply three nonconformity scoring methods (CQR, AER, STD)
- Obtain calibrated prediction intervals with guaranteed coverage
- Validate coverage on independent test sets
- Handle out-of-distribution scenarios
Key Result: CP provides valid uncertainty bounds regardless of model architecture or training procedure, requiring only exchangeability between calibration and test data.
All experiments follow this 6-step framework:
1. Generate/gather training data (simulation or experimental)
2. Train surrogate model (or use pre-trained model)
3. Generate/gather calibration dataset
4. Compute nonconformity scores via CP framework
5. Construct prediction sets with guaranteed coverage (1-α)
6. Validate coverage on independent test set
Codes to generate data, train surrogate models and to perform CP for base experiments is given here. For the more advanced experiments, go through the repositories given below:
Plasma Modelling, Foundation Physics Models, Neural Weather Models
👉 A live interactive demo of our method applied to the Poisson Equation can be found in this server
In order to run the interactive demo locally, navigate to Demos/ and execute streamlit run streamlit_poisson_cp.py, one the requirements are installed. A readme is attached to the demo to help get started.
.
├── Poisson/ # 1D Poisson equation experiments
├── Conv_Diff/ # 1D Convection-Diffusion experiments
├── Wave/ # 2D Wave equation experiments
├── Navier_Stokes/ # 2D Navier-Stokes experiments (FNO data)
└── README.md
Physics: Steady-state elliptic PDE modeling electrostatics, gravitation, and potential fields.
Domain: 1D spatial domain [0, 1] discretized into 32 points
Data Generation:
- 7,000 simulations using finite difference methods (py-pde package)
- Training: 5,000 samples
- Calibration: 1,000 samples
- Validation: 1,000 samples
- Parameter: Initial field value u_init ~ U(0, 4)
Model: 3-layer MLP (64 neurons per layer)
Workflow:
The Poisson experiment includes data generation, model training, and CP evaluation all in a single script:
cd Poisson/
# This single script performs:
# 1. Data generation using py-pde solver
# 2. Model training for CQR (3 models), AER (1 model), STD (1 model with dropout)
# 3. Conformal prediction calibration and evaluation
python Poisson_NN_CP.pyThe script will:
- Generate PDE solutions using finite difference methods
- Train separate models for each CP method
- Perform calibration on the calibration set
- Validate coverage on the test set
- Generate visualization plots
Results: All methods achieve ~90% coverage with tight error bars (AER: 0.002 normalized units)
Physics: Spatio-temporal system combining diffusion (smoothing) and convection (transport).
Domain:
- Spatial: x ∈ [0, 10] with 200 points
- Temporal: t ∈ [0, 0.1] with 100 time steps (downsampled to 20)
Data Generation:
- 5,000 simulations using forward-time centered-space finite difference
- Training: 3,000 samples (Latin hypercube sampling)
- Calibration: 1,000 samples (out-of-distribution)
- Validation: 1,000 samples (out-of-distribution)
Sampling Parameters:
Training:
| Parameter | Domain | Type |
|---|---|---|
| Diffusion Coefficient (D) | [sin(x/π), sin(x/2π)] | Continuous |
| Convection velocity (c) | [0.1, 0.5] | Continuous |
| Mean (μ) | [1.0, 8.0] | Continuous |
| Variance (σ²) | [0.25, 0.75] | Continuous |
Calibration/Validation (Out-of-Distribution):
| Parameter | Domain | Type |
|---|---|---|
| Diffusion Coefficient (D) | [sin(x/2π), sin(x/4π)] | Continuous |
| Convection velocity (c) | [0.5, 1.0] | Continuous |
| Mean (μ) | [1.0, 8.0] | Continuous |
| Variance (σ²) | [0.25, 0.75] | Continuous |
Model: 1D U-Net (4 encoder-decoder levels, batch normalization, tanh activation)
- Maps first 10 time steps → next 10 time steps
Workflow:
cd Conv_Diff/
# Step 1: Generate training data
# Generates 3,000 training samples with in-distribution parameters
python DataGen.py
# Step 2: Generate calibration and validation data
# Generates 2,000 OOD samples (1,000 calibration + 1,000 validation)
# Edit DataGen.py to change parameter bounds for OOD regime before running
# Step 3: Train model and perform CP evaluation
# This script trains the U-Net and evaluates all three CP methods
python CD_UNet_CP.pyScript Details:
DataGen.py:
- Uses
RunSim.pyto solve the convection-diffusion PDE - Performs Latin hypercube sampling over parameter space
- Saves data as
.npzfiles inData/directory
CD_UNet_CP.py:
- Loads generated data
- Trains separate U-Nets for CQR (3 quantile models), AER (1 model), STD (1 model with dropout)
- Performs conformal calibration
- Validates coverage and generates plots
Results: CP successfully calibrates even with distribution shift:
- CQR: 25.53% → 93.05% coverage after calibration
- STD: 88.43% → 90.29% coverage after calibration
Physics: 2D wave equation modeling wave propagation in acoustics, optics, and quantum mechanics.
Domain:
- Spatial: [−1, 1]² with 33×33 grid
- Temporal: t ∈ [0, 1] with 80 time steps
- Wave velocity: c (constant for training, c/2 for out-of-distribution testing)
Data Generation:
- Spectral solver with leapfrog time discretization
- Chebyshev spectral method for spatial discretization
- Training: 500 simulations
- Calibration: 1,000 simulations
- Validation: 1,000 simulations
Initial Condition Parameters (Latin hypercube sampling):
| Parameter | Domain | Type |
|---|---|---|
| Amplitude (α) | [10, 50] | Continuous |
| X position (β) | [0.1, 0.5] | Continuous |
| Y position (γ) | [0.1, 0.5] | Continuous |
Models:
-
U-Net (Feed-forward):
- Maps 20 input time steps → 30 output time steps
- Output shape: [30, 33, 33]
- Architecture: 2 encoder-decoder levels with batch normalization
-
FNO (Autoregressive):
- Maps 20 initial time steps → recursively predicts next 10 steps
- Total output: 60 time steps
- Output shape: [60, 33, 33]
- 6 Fourier layers with width 32
Workflow:
cd Wave/
# Step 1: Generate training, calibration, and validation data
# Uses spectral solver to generate wave solutions
python Spectral_Wave_Data_Gen.py
# The script includes functions:
# - wave_solution(): Solves 2D wave equation for given parameters
# - LHS_Sampling(): Generates data using Latin hypercube sampling
# - Parameter_Scan(): Alternative grid-based sampling (optional)
# Step 2a: Train U-Net model
# Trains U-Net for feed-forward prediction
python Wave_UNet.py
# Step 2b: Train FNO model
# Trains FNO for autoregressive prediction
python Wave_FNO.py
# Step 3a: Evaluate U-Net with Conformal Prediction
# Loads trained U-Net and performs CP calibration/validation
python Wave_Unet_CP.py
# Step 3b: Evaluate FNO with Conformal Prediction
# Loads trained FNO and performs CP calibration/validation
python Wave_FNO_CP.py
# Optional: Out-of-distribution testing (half wave speed)
python tests/Wave_Unet_CP_halfspeed.py
python tests/Wave_FNO_CP_halfspeed.pyScript Details:
Spectral_Wave_Data_Gen.py:
- Implements spectral solver for 2D wave equation
wave_solution(amplitude, x_pos, y_pos): Generates single simulationLHS_Sampling(): Generates full dataset with Latin hypercube sampling- Saves data as
.npzfile inData/directory
Training Scripts (Wave_UNet.py, Wave_FNO.py):
- Load generated data from
.npzfile - Train model for 500 epochs
- Save trained model weights
CP Evaluation Scripts (Wave_Unet_CP.py, Wave_FNO_CP.py):
- Load trained model weights
- Load calibration and validation data
- Compute nonconformity scores (CQR, AER, STD)
- Perform calibration and validate coverage
- Generate visualization plots
Results:
- U-Net: ~90% coverage with all methods (AER most efficient)
- FNO: Tighter coverage than U-Net, maintains validity under distribution shift
Physics: Incompressible 2D Navier-Stokes equations modeling viscous fluid dynamics (vorticity formulation).
Domain:
- Spatial: (0, 1) × (0, 1) grid with 64×64 resolution
- Temporal evolution of vorticity field
Data Source: Uses pre-generated data from the FNO paper (Li et al., 2021)
Setup:
- Training: viscosity ν = 10⁻³
- Calibration/Testing: viscosity ν = 10⁻⁴ (out-of-distribution)
- Maps 10 input time steps → next 10 output time steps
Model: Fourier Neural Operator (FNO)
- Modified with dropout layers for STD-based CP
- 6 Fourier layers
- Modes: 12 (spectral truncation)
- Width: configurable
Workflow:
cd Navier_Stokes/
# Step 1: Download FNO data from original paper
# Download the Navier-Stokes dataset from:
# https://drive.google.com/drive/folders/1UnbQh2WWc6knEHbLn-ZaXrKUZhp7pjt-
# Required files:
# - ns_V1e-3_N5000_T50.mat (training data, ν = 10⁻³)
# - ns_V1e-4_N10000_T30.mat (test data, ν = 10⁻⁴)
# Or use the direct links from the neuraloperator repository:
wget https://drive.google.com/uc?export=download&id=1r3idxpsHa21ijhlu3QQ1hVvi7SnyHLLT \
-O ns_V1e-3_N5000_T50.mat
wget https://drive.google.com/uc?export=download&id=1ViYaXD8MfH48x5aYiKbEZH4JRO1YZY2f \
-O ns_V1e-4_N10000_T30.mat
# Convert .mat files to .npy format (if needed)
# The training script will handle loading the data
# Step 2: Train FNO model
# Trains FNO on ν = 10⁻³ data
python FNO_NS_Zongyi.py
# Step 3: Evaluate with Conformal Prediction
# Loads trained FNO and performs CP on ν = 10⁻⁴ data (out-of-distribution)
python NS_FNO_CP.pyScript Details:
FNO_NS_Zongyi.py:
- Loads Navier-Stokes data from
.mator.npyfiles - Implements FNO architecture (see
utils.pyfor model definition) - Trains on ν = 10⁻³ data for 500 epochs
- Saves trained model weights
NS_FNO_CP.py:
- Loads trained FNO model
- Loads ν = 10⁻⁴ data for calibration and testing
- Performs CP using AER and STD methods
- Validates coverage on out-of-distribution regime
- Generates visualization plots
Data Information:
The FNO paper provides two datasets:
-
Training data (ν = 10⁻³):
- 5,000 initial conditions
- 50 time steps each
- 64×64 spatial resolution
-
Test data (ν = 10⁻⁴):
- 10,000 initial conditions
- 30 time steps each
- 64×64 spatial resolution
Results:
- Uncalibrated dropout: only 7.52% coverage
- After CP calibration: 90.27% coverage (STD method)
- Demonstrates CP's ability to correct severely underestimated uncertainty in OOD settings
Each experiment follows a consistent workflow pattern:
1. Data Generation → 2. Model Training → 3. CP Evaluation
| Experiment | Data Generation | Model Training | CP Evaluation |
|---|---|---|---|
| Poisson | Integrated in main script | Integrated in main script | Poisson_NN_CP.py |
| Conv-Diff | DataGen.py + RunSim.py |
Integrated in CP script | CD_UNet_CP.py |
| Wave | Spectral_Wave_Data_Gen.py |
Wave_UNet.py / Wave_FNO.py |
Wave_Unet_CP.py / Wave_FNO_CP.py |
| Navier-Stokes | Download from FNO paper | FNO_NS_Zongyi.py |
NS_FNO_CP.py |
- Poisson: All-in-one script that generates data, trains models, and evaluates CP
- Convection-Diffusion: Separate data generation, then combined training + CP evaluation
- Wave: Fully separated pipeline - data generation, training, and CP evaluation are independent
- Navier-Stokes: Uses external data source (FNO paper), then training and CP evaluation
-
Data Generation Scripts (
DataGen.py,Spectral_Wave_Data_Gen.py,RunSim.py):- Solve PDEs numerically
- Sample parameter space (Latin hypercube or uniform)
- Save datasets as
.npzor.npyfiles
-
Training Scripts (
Wave_UNet.py,Wave_FNO.py,FNO_NS_Zongyi.py):- Load generated/downloaded data
- Train neural network surrogates
- Save model weights
-
CP Evaluation Scripts (
*_CP.py):- Load trained models and data
- Compute nonconformity scores for CQR, AER, STD
- Perform calibration
- Validate coverage guarantees
- Generate plots and metrics
The repository implements three nonconformity scoring methods:
Description: Train three separate models to predict quantiles (5th, 50th, 95th percentiles).
Nonconformity Score: s(x,y) = max{q̂_low(x) - y, y - q̂_high(x)}
Prediction Set: [q̂_low(x) - q̂, q̂_high(x) + q̂]
Training Loss: Quantile loss (pinball loss)
Requirements:
- 3 models (or 1 model with 3 outputs)
- No architectural modifications needed
Description: Train a single deterministic model, compute absolute residuals.
Nonconformity Score: s(x,y) = |y - ŷ(x)|
Prediction Set: [ŷ(x) - q̂, ŷ(x) + q̂]
Training Loss: MSE or MAE
Requirements:
- 1 model
- No architectural modifications
- Most computationally efficient
Description: Use probabilistic models that output mean and standard deviation.
Nonconformity Score: s(x,y) = |y - μ(x)| / σ(x)
Prediction Set: [μ(x) - q̂σ(x), μ(x) + q̂σ(x)]
Training Loss: MSE (with dropout) or negative log-likelihood
Requirements:
- 1 probabilistic model
- Architectural modifications: dropout layers (rate 0.1-0.2)
- Monte Carlo sampling at inference (typically 100 samples)
All experiments use similar training setups:
configuration = {
"Epochs": 500,
"Batch Size": 50,
"Optimizer": 'Adam',
"Learning Rate": 0.005,
"Scheduler Step": 100,
"Scheduler Gamma": 0.5,
"Normalisation Strategy": 'Min-Max', # Scale to [-1, 1]
}-
For CQR: Train three models with quantile loss
# Train models for 5th, 50th, 95th percentiles loss = quantile_loss(predictions, targets, quantile=[0.05, 0.5, 0.95])
-
For AER: Train one model with MSE loss
loss = F.mse_loss(predictions, targets)
-
For STD: Train one model with dropout + MSE loss
model = UNetWithDropout(dropout_rate=0.1) loss = F.mse_loss(predictions, targets)
All experiments use Min-Max normalization to scale field values to [-1, 1]:
def normalize(data):
return 2 * (data - data.min()) / (data.max() - data.min()) - 1
def denormalize(normalized_data, original_min, original_max):
return (normalized_data + 1) / 2 * (original_max - original_min) + original_minimport torch
import numpy as np
from model import UNet1d # or your model architecture
# Load model
model = UNet1d(T_in=10, step=10, width=32)
model.load_state_dict(torch.load('trained_model.pth'))
model.eval()
# Load data
cal_inputs = ... # Shape: (n_cal, T_in, spatial_dims...)
cal_targets = ... # Shape: (n_cal, T_out, spatial_dims...)
test_inputs = ... # Shape: (n_test, T_in, spatial_dims...)
test_targets = ... # Shape: (n_test, T_out, spatial_dims...)For AER:
def compute_nonconformity_scores(predictions, targets):
"""
Compute absolute error residuals
Args:
predictions: model predictions (n_samples, T_out, ...)
targets: ground truth (n_samples, T_out, ...)
Returns:
scores: nonconformity scores (n_samples, T_out, ...)
"""
return np.abs(targets - predictions)For CQR:
def compute_nonconformity_scores_cqr(pred_lower, pred_upper, targets):
"""
Compute CQR nonconformity scores
Args:
pred_lower: lower quantile predictions (n_samples, T_out, ...)
pred_upper: upper quantile predictions (n_samples, T_out, ...)
targets: ground truth (n_samples, T_out, ...)
Returns:
scores: nonconformity scores (n_samples, T_out, ...)
"""
return np.maximum(pred_lower - targets, targets - pred_upper)For STD:
def compute_nonconformity_scores_std(mean, std, targets, n_samples=100):
"""
Compute STD-normalized nonconformity scores
Args:
mean: predicted mean (n_samples, T_out, ...)
std: predicted standard deviation (n_samples, T_out, ...)
targets: ground truth (n_samples, T_out, ...)
Returns:
scores: normalized nonconformity scores (n_samples, T_out, ...)
"""
return np.abs(targets - mean) / (std + 1e-10)def calibrate(cal_scores, n_cal, alpha):
"""
Compute the (1-alpha) quantile of calibration scores
Args:
cal_scores: nonconformity scores from calibration set
n_cal: number of calibration samples
alpha: desired miscoverage rate (e.g., 0.1 for 90% coverage)
Returns:
q_hat: calibrated quantile threshold
"""
# Compute quantile level (accounts for finite sample correction)
q_level = np.ceil((n_cal + 1) * (1 - alpha)) / n_cal
# Compute quantile across calibration set (cell-wise for spatio-temporal data)
q_hat = np.quantile(cal_scores, q_level, axis=0, method='higher')
return q_hat# Generate calibration predictions
with torch.no_grad():
cal_predictions = model(torch.FloatTensor(cal_inputs)).numpy()
# Compute calibration scores
cal_scores = compute_nonconformity_scores(cal_predictions, cal_targets)
# Calibrate
alpha = 0.1 # 90% coverage
n_cal = cal_inputs.shape[0]
q_hat = calibrate(cal_scores, n_cal, alpha)For AER:
# Generate test predictions
with torch.no_grad():
test_predictions = model(torch.FloatTensor(test_inputs)).numpy()
# Construct prediction intervals
prediction_sets = [
test_predictions - q_hat, # Lower bound
test_predictions + q_hat # Upper bound
]For STD (with Monte Carlo Dropout):
def predict_with_dropout(model, inputs, n_samples=100):
"""Generate predictions with MC dropout"""
model.train() # Enable dropout
predictions = []
with torch.no_grad():
for _ in range(n_samples):
pred = model(torch.FloatTensor(inputs)).numpy()
predictions.append(pred)
predictions = np.array(predictions)
mean = predictions.mean(axis=0)
std = predictions.std(axis=0)
return mean, std
# Generate test predictions with uncertainty
test_mean, test_std = predict_with_dropout(model, test_inputs, n_samples=100)
# Construct prediction intervals
prediction_sets = [
test_mean - q_hat * test_std, # Lower bound
test_mean + q_hat * test_std # Upper bound
]def empirical_coverage(prediction_sets, targets):
"""
Compute empirical coverage
Args:
prediction_sets: [lower_bounds, upper_bounds]
targets: ground truth
Returns:
coverage: fraction of points within prediction intervals
"""
lower, upper = prediction_sets
in_interval = (targets >= lower) & (targets <= upper)
coverage = in_interval.mean()
return coverage
# Validate
coverage = empirical_coverage(prediction_sets, test_targets)
print(f"Empirical coverage: {coverage:.4f}")
print(f"Target coverage: {1-alpha:.4f}")import torch
import numpy as np
from model import UNet1d
# Configuration
alpha = 0.1 # 90% coverage
T_in, T_out = 10, 10
width = 32
# Load model
model = UNet1d(T_in, T_out, width)
model.load_state_dict(torch.load('trained_model.pth'))
# Load data
cal_inputs = np.load('cal_inputs.npy')
cal_targets = np.load('cal_targets.npy')
test_inputs = np.load('test_inputs.npy')
test_targets = np.load('test_targets.npy')
# Step 1: Calibration predictions
model.eval()
with torch.no_grad():
cal_preds = model(torch.FloatTensor(cal_inputs)).numpy()
# Step 2: Compute nonconformity scores
cal_scores = np.abs(cal_targets - cal_preds)
# Step 3: Calibrate
n_cal = cal_inputs.shape[0]
q_level = np.ceil((n_cal + 1) * (1 - alpha)) / n_cal
q_hat = np.quantile(cal_scores, q_level, axis=0, method='higher')
# Step 4: Test predictions
with torch.no_grad():
test_preds = model(torch.FloatTensor(test_inputs)).numpy()
# Step 5: Prediction sets
lower = test_preds - q_hat
upper = test_preds + q_hat
# Step 6: Validate coverage
in_interval = (test_targets >= lower) & (test_targets <= upper)
coverage = in_interval.mean()
print(f"Target coverage: {1-alpha:.2%}")
print(f"Empirical coverage: {coverage:.2%}")
print(f"Average interval width: {(upper - lower).mean():.4f}")CP requires exchangeability between calibration and test data. This means:
✅ Valid:
- Calibration and test from same distribution
- Calibration and test from different distributions (out-of-distribution CP)
- Using pre-trained models with new calibration data
❌ Invalid:
- Temporal data where test comes after calibration in time series
- Calibration data from biased/selected subset
For spatio-temporal outputs, CP is applied independently to each cell:
# predictions shape: (n_samples, T_out, Nx, Ny)
# Each cell (t, x, y) gets its own calibrated quantile
cal_scores = np.abs(targets - predictions) # Shape: (n_cal, T_out, Nx, Ny)
q_hat = np.quantile(cal_scores, q_level, axis=0) # Shape: (T_out, Nx, Ny)This provides marginal coverage at every spatio-temporal location.
The quantile level includes finite sample correction:
q_level = np.ceil((n_cal + 1) * (1 - alpha)) / n_calFor large n_cal, this approaches (1 - alpha). For small n_cal, it provides conservative bounds.
CP can provide valid coverage even when test distribution differs from training (as long as we have calibration dtaa in the new regime):
- Train model on distribution P_train
- Calibrate on distribution P_cal (can differ from P_train)
- Test on distribution P_test
Requirement: P_cal and P_test must be exchangeable (not P_train and P_test)
Example: Convection-Diffusion experiment
- Training: D ∈ [sin(x/π), sin(x/2π)], c ∈ [0.1, 0.5]
- Cal/Test: D ∈ [sin(x/2π), sin(x/4π)], c ∈ [0.5, 1.0]
- GPU recommended (experiments used NVIDIA A100)
- Training time varies by problem:
- Poisson: ~5 minutes
- Convection-Diffusion: ~30 minutes
- Wave (U-Net): ~1 hour
- Wave (FNO): ~2 hours
- Navier-Stokes: ~4 hours
- CPU sufficient (experiments used standard laptop)
- Calibration time per method:
- AER: <10 seconds (fastest)
- CQR: <30 seconds
- STD: <500 seconds (slowest due to MC sampling)
- Depends on dataset size and spatial resolution
- Typical requirements:
- 1D problems: 4-8 GB RAM
- 2D problems: 16-32 GB RAM
# Core dependencies
torch>=1.12.0
numpy>=1.21.0
scipy>=1.7.0
matplotlib>=3.4.0
# PDE simulation
py-pde>=0.23.0 # For Poisson equation
# Optional (for specific experiments)
h5py>=3.0.0 # For loading FNO datasetsInstall all dependencies:
pip install torch numpy scipy matplotlib py-pde h5pyTo reproduce the results in Table 1 of the paper, follow these steps for each experiment:
cd Poisson
# All-in-one: data generation + training + CP evaluation
python Poisson_NN_CP.pycd Conv_Diff
# Step 1: Generate training data (in-distribution)
python DataGen.py
# Step 2: Edit DataGen.py to change parameter bounds
# Change to OOD parameters (see experiment section), then run:
python DataGen.py # Generate calibration/validation data
# Step 3: Train models and evaluate CP
python CD_UNet_CP.pycd Wave
# Step 1: Generate data
python Spectral_Wave_Data_Gen.py
# Step 2: Train U-Net
python Wave_UNet.py
# Step 3: Evaluate with CP
python Wave_Unet_CP.py
# Optional: OOD testing at half wave speed
python tests/Wave_Unet_CP_halfspeed.pycd Wave
# Step 1: Generate data (if not done already)
python Spectral_Wave_Data_Gen.py
# Step 2: Train FNO
python Wave_FNO.py
# Step 3: Evaluate with CP
python Wave_FNO_CP.py
# Optional: OOD testing at half wave speed
python tests/Wave_FNO_CP_halfspeed.pycd Navier_Stokes
# Step 1: Download data from FNO paper
# See experiment section for download links
# Step 2: Train FNO
python FNO_NS_Zongyi.py
# Step 3: Evaluate with CP
python NS_FNO_CP.pyData Generation Scripts:
- Solve PDEs numerically with varying parameters
- Sample parameter space (Latin hypercube)
- Output:
.npzor.npyfiles inData/directory
Training Scripts:
- Load generated/downloaded data
- Train neural network models (500-1000 epochs)
- Output: Model weights (
.pthfiles) inModels/directory
CP Evaluation Scripts:
- Load trained models
- Load calibration and validation data
- Train separate models for each CP method (CQR, AER, STD)
- Compute nonconformity scores
- Perform calibration
- Validate coverage
- Output: Plots and coverage metrics
| Experiment | Model | Method | Uncal. Coverage | Calib. Coverage | Cal. Time (s) | Tightness |
|---|---|---|---|---|---|---|
| Poisson | MLP | AER | - | 90.05% | 0.003 | 0.002 |
| Conv-Diff* | U-Net | CQR | 25.53% | 93.05% | 19.70 | 0.314 |
| Conv-Diff* | U-Net | AER | - | 92.60% | 8.30 | 0.266 |
| Wave | U-Net | AER | - | 94.91% | 3.52 | 0.013 |
| Wave* | FNO | AER | - | 89.24% | 34.18 | 0.330 |
| Navier-Stokes* | FNO | STD | 7.52% | 90.27% | 64.75 | 0.448 |
*Out-of-distribution evaluation
Key Takeaways:
- All methods achieve ~90% target coverage after CP calibration
- AER is fastest and works well when architectures are flexible
- CP successfully handles distribution shift (marked with *)
- Uncalibrated methods can severely underestimate uncertainty (e.g., 7.52% → 90.27%)
If you use this code in your research, please cite:
@misc{gopakumar2024uncertaintyquantificationsurrogatemodels,
title={Uncertainty Quantification of Surrogate Models using Conformal Prediction},
author={Vignesh Gopakumar and Ander Gray and Joel Oskarsson and Lorenzo Zanisi and Stanislas Pamela and Daniel Giles and Matt Kusner and Marc Peter Deisenroth},
year={2024},
eprint={2408.09881},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2408.09881},
}-
Conformal Prediction:
-
Neural Operators:
MIT License
For questions or issues, please open an issue on GitHub or contact v.gopakumar@ucl.ac.uk
