QU-bounds: Conformal Prediction for QSAR Models

A model-agnostic open-source pipeline that retrofits any QSAR model with statistically guaranteed uncertainty estimates using adaptive conformal prediction (CP). The framework produces prediction intervals for regression endpoints and prediction sets for classification endpoints at a user-specified nominal confidence level (e.g. 90%), without retraining the underlying model.

Overview

Traditional QSAR Applicability Domain (AD) methods output heuristic similarity scores without formal statistical guarantees. QU-bounds wraps any existing QSAR model as a calibration layer: given only the model's predictions and a calibration set, it derives intervals or label sets with guaranteed marginal coverage under the exchangeability assumption.

Key properties:

No retraining required. Only model predictions and a calibration set are needed.
Statistically guaranteed coverage. At confidence level 1−α, at least (1−α)×100% of test predictions will contain the true value, as a mathematical consequence of the rank-based calibration procedure.
Model-agnostic. Works with regression and classification QSAR models from any platform (VEGA, OPERA, OCHEM, etc.) or trained internally.
AD-aware. Auxiliary models trained on molecular fingerprints (ECFP4) serve as nonconformity measures (NCMs); most existing AD indices can play the same role. Conformal efficiency metrics (interval width, singleton rate) correlate strongly with AD indices.
Hard-label classifiers supported. A novel ordinal distance strategy generates pseudo-probabilities compatible with standard conformal methods, extending the approach to classifiers that produce only hard labels.
Large-scale ready. Validated on over 100 VEGA QSAR models spanning physicochemical, toxicity, and environmental endpoints, and demonstrated on the EPA CompTox chemical inventory.

Repository structure

qubounds_clean/
├── src/qubounds/               # Core library
│   ├── mapie_regression.py     # Conformal regression: ExternalPredictor, train/predict functions
│   ├── mapie_class_lac.py      # Conformal classification: NCMProbabilisticClassifier, Least Ambiguous Set-valued Classifier (LAC) wrapper
│   ├── mapie_class_proba.py    # Classification variant using native predict_proba()
│   ├── mapie_diagnostic.py     # Sigma model factory, exchangeability tests, diagnostics
│   ├── descriptors/            # ECFP4 fingerprint computation with SQLite cache
│   └── vega/                   # VEGA-specific data loaders and utilities
│
├── tasks/                      # Ploomber pipeline tasks (notebooks-as-scripts)
│   ├── tutorial/               # Self-contained tutorial tasks
│   │   ├── load_dataset.py             # Download/load and split a dataset
│   │   ├── mapie_native.py             # Regression CP: internal + external model
│   │   ├── mapie_classification.py     # Classification CP: Approach with A. class probabilities + B. hard classifier
│   │   ├── mapie_class_external.py     # Classification with external hard-label predictions
│   │   ├── compare_variants.py         # Compare adaptive vs plain CP variants
│   │   └── ad_comparison.py            # Correlate CP width / set size with AD indices
│   │
│   ├── mapie.py                        # VEGA regression: train conformal models
│   ├── mapiec.py                       # VEGA classification: train conformal models
│   ├── mapie_apply.py                  # Apply trained CP models to new compound sets
│   ├── mapie_apply_plot.py             # Visualise application results
│   ├── mapie_plot.py                   # Regression summary plots (Spearman vs AD)
│   ├── mapie_plot_class.py             # Classification summary plots
│   ├── mapie_regression_analysis.py    # Aggregate regression analysis
│   ├── mapie_class_analysis.py         # Aggregate classification analysis
│   ├── conformal_regression_summary.py # Cross-model regression summary table
│   ├── conformal_classification_summary.py
│   ├── make_archive.py
│   └── vega/                           # VEGA-specific preparation tasks
│
├── resources/                  # Bundled example datasets
│   ├── BCF_MEYLAN.xlsx         # Bioconcentration factor (VEGA train/test split + ADI)
│   ├── HENRY_OPERA.xlsx        # Henry's law constant
│   ├── ZEBRAFISH_CORAL.xlsx    # Zebrafish developmental toxicity
│   ├── toxicity_against_t_pyriformis.csv   # T. pyriformis IGC50 (OCHEM)
│   ├── ames_levenberg.csv      # Ames mutagenicity (OCHEM)
│   └── vega_models_withclasses.xlsx        # VEGA model registry with endpoint metadata
│
├── pipeline.tutorial.yaml      # Tutorial pipeline definition (Ploomber)
├── pipeline.mapie.yaml         # Full conformal prediction pipeline definition
├── pipeline.preparevega.yaml   # VEGA data preparation pipeline
├── pipeline.preparecomptox.yaml# CompTox data preparation pipeline
├── pipeline.archive.yaml       # Archive/export pipeline
├── env.tutorial.yaml           # Dataset and parameter configuration for pipeline.tutorial.yaml
├── env.yaml                    # Full pipeline configuration (model lists, paths)
└── pyproject.toml

Conformal prediction: key concepts

Regression

Concept	Definition
Nonconformity score	`s(x,y) = \|y − ŷ\| / σ̂(x)` (adaptive) or `\|y − ŷ\|` (plain)
σ̂(x)	Sigma model: predicts local residual magnitude from ECFP4 fingerprints
q̂	Calibration quantile at level `⌈(n+1)(1−α)⌉/n`
Prediction interval	`ŷ ± q̂·σ̂(x)` (adaptive) or `ŷ ± q̂` (plain)
Coverage guarantee	Marginal coverage ≥ 1−α; holds under exchangeability
Efficiency metric	Mean interval width; narrowed by a better sigma model

The adaptive variant achieves molecule-specific interval widths: wider for structurally novel compounds, narrower for well-represented chemical space.

Classification

Two approaches are implemented and compared in the tutorial:

Approach A — LAC with model probabilities. Requires predict_proba() output. Conformity score: s = 1 − p̂(y_true | x). Gold standard when calibrated probabilities are available.

Approach B — NCM pseudo-probabilities (hard labels). For classifiers that produce only a hard label (the standard case for VEGA models). An auxiliary NCM classifier learns P(|y − ŷ| = d | x) for ordinal distances d = 0, 1, …. These are converted to class pseudo-probabilities:

P_pseudo(class=j | x, ŷ) = P(distance = |j − ŷ| | x)

Least Ambiguous Set-valued Classifier (LAC) is then applied to the pseudo-probabilities. Coverage is guaranteed regardless of NCM quality; NCM quality determines efficiency (singleton rate and mean set size).

Concept	Definition
Prediction set	`C(x) = {y : p̂(y\|x) ≥ 1 − q̂}`
Coverage guarantee	Marginal coverage ≥ 1−α
Efficiency metric	Mean set size; singleton rate (fraction of size-1 sets)

Toxicity Classes

Class	Description
1	Toxic-1 (< 1 mg/L)
2	Toxic-2 (1–10 mg/L)
3	Toxic-3 (10–100 mg/L)
4	Non-Toxic (> 100 mg/L)

Nonconformity measures (NCM codes)

The sigma/NCM model is specified by a code string in the pipeline configuration:

Code	Model	Task
`rlgbmecfp`	LightGBM regressor on ECFP4	Regression sigma
`rfecfp`	Random Forest regressor on ECFP4	Regression sigma
`knn2ecfp`	k-NN regressor on ECFP4	Regression sigma
`crfecfp`	Random Forest classifier (ordinal distances)	Classification NCM
`cgbecfp`	Gradient Boosting classifier (ordinal distances)	Classification NCM
`cknn2jecfp`	k-NN classifier with Jaccard (ordinal distances)	Classification NCM
`gbecfp`	Gradient Boosting regressor on ECFP4	Classification NCM (regression mode)

Installation

Requires Python ≥ 3.12. Dependencies are managed with UV.

git clone <repository_url>
cd qubounds

Key dependencies: mapie, scikit-learn, lightgbm, catboost, mord, rdkit, ploomber, pandas, plotly, statsmodels.

Quick start: tutorial pipeline

The tutorial pipeline demonstrates the full CP workflow on public and bundled datasets. It is the recommended as an entry point for new readers.

Run

ploomber build --entry-point pipeline.tutorial.yaml --env-file env.tutorial.yaml

Output notebooks and Excel files are written to products/tutorial/.

Tutorial datasets (`env.tutorial.yaml`)

Key	Dataset	Task	Source
`esol`	ESOL aqueous solubility	Regression	DeepChem / Delaney
`lipo`	Lipophilicity	Regression	DeepChem
`tpyriformis`	T. pyriformis IGC50	Regression + AD comparison	OCHEM (bundled)
`bcfmeylan`	BCF (Meylan)	Regression + AD comparison	VEGA (bundled)
`zebrafish_coral`	Zebrafish AC50	Regression + AD comparison	VEGA (bundled)
`henry_opera`	Henry's law constant	Regression + AD comparison	VEGA (bundled)
`bbbp`	Blood-brain barrier permeability	Classification	DeepChem
`clintox`	Clinical toxicity (FDA approval)	Classification	DeepChem
`ames_ochem`	Ames mutagenicity	Classification + AD comparison	OCHEM (bundled)

Tutorial task descriptions

Regression datasets pass through four tasks:

tutorial_load_regr_<dataset> — reads or downloads the dataset, applies train/test splitting (using a split_col if provided, otherwise random 80/20), writes Training and Test sheets to Excel and a JSON metadata file.
tutorial_native_<dataset>_<ncm> — fits an internal LightGBM base model + sigma model and runs split conformal regression (both adaptive and plain variants). Produces:
- sigma model diagnostics (R², scatter vs true residuals)
- calibration nonconformity score histograms and q̂ annotation
- prediction interval plots on sorted test set
- exchangeability KS test (calibration vs test NCM scores; p-value uniformity)
- coverage guarantee sweep across α levels
- NCM quality comparison (simulated poor/medium/good sigma models)
- AD correlation analysis (Spearman ρ, boxplots, stratified quintile tables) when ad_cols is configured
tutorial_external_<dataset>_<ncm> — identical workflow but reads predictions from file, simulating an external QSAR tool (VEGA, OCHEM, OPERA). AD comparison is embedded here because AD indices come from the same prediction file.
tutorial_compare_<dataset>_<ncm> — side-by-side comparison of native vs external variants: coverage, mean width, width distribution, per-molecule width reduction.

Classification datasets pass through analogous tasks:

tutorial_load_class_<dataset> — as above, with class-balance reporting.
tutorial_classify_<dataset>_<ncm> — fits LightGBM + Platt calibration (Approach A) and NCM ordinal classifier (Approach B). Produces:
- LAC score distributions per class
- pseudo-probability vs real probability comparison for example molecules
- prediction set visualisations (12 example molecules)
- head-to-head A vs B: coverage, mean set size, singleton rate
- NCM quality comparison
- exchangeability KS test
- coverage guarantee sweep across α levels
- per-class coverage table
- AD correlation analysis (set size and singleton rate vs AD indices) when configured
tutorial_classify_external_<dataset>_<ncm> — external hard-label variant (Approach B only).

Adding your own dataset

Edit env.tutorial.yaml. Minimum required fields:

tutorial_regr_datasets:
  mymodel:
    path: "path/to/predictions.xlsx"
    target_col: "Exp value"
    smiles_col: "Smiles"

To add external predictions and AD comparison:

    pred_col: "Predicted value"        # column with external model predictions
    split_col: "Status"                # column distinguishing train/test rows
    split_train_value: "TRAINING"
    split_test_value: ["TEST"]
    ad_cols: ["ADI", "Accuracy index"] # AD metric columns in the same file
    ad_col_directions: ["similarity", "similarity"]  # or "distance"
    n_quantile_bins: 5
    software: "VEGA"

ad_col_directions: similarity means higher AD value = more reliable (e.g. ADI); distance is the reverse (e.g. leverage).

Full Conformal prdiction with external QSAR models pipeline

The full pipeline (pipeline.mapie.yaml + env.yaml) trains conformal models for all VEGA endpoints and applies them to an external compound inventory (e.g. EPA CompTox).

The pipeline reads VEGA prediction form files ; does not run VEGA itself.
Therefore it is applicable to arbitrary QSAR models, provided the input files are in the required format.
Look at pipeline.prepare*.yaml for examples of conversion

Prepare VEGA input files (optional)

The data files can be downloaded from Zenodo https://doi.org/10.5281/zenodo.18444068.
The pipeline is here for completeness.

Copy env.preparevega.example.yaml to env.preparevega.yaml and configure:

VEGA_TRAINING_SETS: "path/to/vega/datasets"     # VEGA-exported training set files
VEGA_RESULT_FILES:  "path/to/vega/results"      # VEGA prediction output files
VEGA_QUBOUNDS_INPUT: "products/qubounds_input"  # staging area for pipeline input

Then:

ploomber build --entry-point pipeline.preparevega.yaml --env-file env.preparevega.yaml

Configure and run full pipeline

In env.yaml, set the compound inventory paths under config:

config:
  comptox:
    VEGA_REPORTS_INPUT: "path/to/vega/epacomptox"
    SMILES_INPUT: "path/to/epacomptox.txt"
    alpha: 0.1
    id: "DTXSID"

Then:

ploomber build --entry-point pipeline.mapie.yaml --env-file env.yaml

VEGA models covered

env.yaml lists 52 regression and 53 classification VEGA endpoints, including:

Regression: BCF (multiple models), LogP, melting point, water solubility, aquatic toxicity (algae, Daphnia, fish), carcinogenicity potency, Henry's law, KOC, KOA, LD50, LOAEL, NOAEL, vapour pressure, persistence (air/soil/water/sediment), zebrafish AC50.
Classification: aromatase, Ames mutagenicity, carcinogenicity, developmental toxicity, eye/skin irritation and sensitisation, estrogen/androgen/glucocorticoid/thyroid receptor activity, hepatotoxicity (NRF2, PPAR-α/γ, PXR), micronucleus (in vitro/in vivo), P-glycoprotein, persistence (sediment/soil/water), TPO.

Exchangeability and coverage diagnostics

The coverage guarantee holds under the exchangeability assumption: calibration and test compounds are drawn from the same input space. When this is violated (structural extrapolation), coverage may fall below the nominal level. The pipeline signals this through:

widening prediction intervals or shrinking singleton rates for out-of-domain compounds,
Kolmogorov-Smirnov tests comparing calibration and test nonconformity score distributions,
conformal p-value uniformity tests (p-values should be uniform under exchangeability).

These diagnostics are computed automatically in all tutorial and full pipeline tasks.

Pipeline outputs

All tasks write to products/:

Path	Contents
`products/tutorial/<task>/`	Per-dataset notebooks (`.ipynb`), Excel with predictions and metrics, PNG diagnostic plots
`products/qubounds_output/mapie/vega/regression/<ncm>/<model>/`	Trained conformal models (`.pkl`); per-model predictions and metrics (`.xlsx`)
`products/qubounds_output/mapie/vega/regression/summary.xlsx`	Cross-model summary: coverage, mean width, Spearman ρ vs AD indices
`products/qubounds_output/mapie/vega/class_lac/<ncm>/<model>/`	Classification conformal models and per-model results
`products/qubounds_output/mapie/vega/class_lac/summary.xlsx`	Cross-model classification summary
`products/comptox/<input_key>/regression/`	CP results applied to compound inventory (regression)
`products/comptox/<input_key>/class_lac/`	CP results applied to compound inventory (classification)

Each per-prediction Excel file includes: SMILES, true value (if available), model prediction, lower/upper interval bounds (regression) or per-class pseudo-probabilities and prediction sets (classification), interval width or set size, and coverage indicators.

Acknowledgement

🇪🇺 This project has received funding from the European Union's Horizon Europe research and innovation program under grant agreements 101092164 and 101130073.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QU-bounds: Conformal Prediction for QSAR Models

Overview

Repository structure

Conformal prediction: key concepts

Regression

Classification

Toxicity Classes

Nonconformity measures (NCM codes)

Installation

Quick start: tutorial pipeline

Run

Tutorial datasets (`env.tutorial.yaml`)

Tutorial task descriptions

Adding your own dataset

Full Conformal prdiction with external QSAR models pipeline

Prepare VEGA input files (optional)

Configure and run full pipeline

VEGA models covered

Exchangeability and coverage diagnostics

Pipeline outputs

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
resources		resources
src/qubounds		src/qubounds
tasks		tasks
LICENSE		LICENSE
README.md		README.md
classification_demo.png		classification_demo.png
classification_exchangeability.png		classification_exchangeability.png
classification_exchangeability_classwise.png		classification_exchangeability_classwise.png
env.preparevega.example.yaml		env.preparevega.example.yaml
env.tutorial.yaml		env.tutorial.yaml
env.yaml		env.yaml
logo.svg		logo.svg
pipeline.archive.yaml		pipeline.archive.yaml
pipeline.mapie.yaml		pipeline.mapie.yaml
pipeline.preparecomptox.yaml		pipeline.preparecomptox.yaml
pipeline.preparevega.yaml		pipeline.preparevega.yaml
pipeline.tutorial.yaml		pipeline.tutorial.yaml
pyproject.toml		pyproject.toml
regression_demo.png		regression_demo.png
regression_exchangeability.png		regression_exchangeability.png

Folders and files

Latest commit

History

Repository files navigation

QU-bounds: Conformal Prediction for QSAR Models

Overview

Repository structure

Conformal prediction: key concepts

Regression

Classification

Toxicity Classes

Nonconformity measures (NCM codes)

Installation

Quick start: tutorial pipeline

Run

Tutorial datasets (env.tutorial.yaml)

Tutorial task descriptions

Adding your own dataset

Full Conformal prdiction with external QSAR models pipeline

Prepare VEGA input files (optional)

Configure and run full pipeline

VEGA models covered

Exchangeability and coverage diagnostics

Pipeline outputs

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tutorial datasets (`env.tutorial.yaml`)

Packages