NPstereo

Assigning the Stereochemistry of Natural Products by Machine Learning

NPstereo is a transformer-based language model that assigns stereochemistry (tetrahedral R/S, double bond E/Z) to natural products (NPs) from their SMILES representations. It was trained on referenced data from the open-access COCONUT database and can complete missing stereocenters in partially assigned or fully unassigned NP structures.

This repository contains:

Data extraction and augmentation scripts
Model training and evaluation code
Utilities for running predictions on new molecules

Overview

Nature shows strong stereochemical regularities — L-amino acids in proteins, D-sugars in nucleotides. The idea of NPstereo is to investigate whether similar patterns exist in other NPs and whether they can be machine-learned.

Key features:

Predicts stereochemistry directly from SMILES
Supports full assignment and completion of partially assigned stereochemistry
Achieves 80.2% per-stereocenter accuracy for full assignment and 85.9% for partial assignment on canonical SMILES
Works across diverse NP classes (alkaloids, polyketides, lipids, terpenes, peptides, glycosides, etc.)

For details, see the manuscript!

Installation

Clone the repository:

git clone https://github.com/reymond-group/NPstereo.git
cd NPstereo

Create and activate a conda environment:

conda create -n npstereo python=3.10
conda activate npstereo

Install dependencies:

pip install -r requirements.txt

Data

Training and evaluation data are derived from the COCONUT database, filtered for entries with at least one DOI reference.
We include:

Scripts to extract and process the dataset (data/)
Data augmentation methods:
- SMILES randomization
- Partial stereochemistry removal
Download links to processed datasets on Zenodo. This data is too heavy to include on GitHub, however following the code you should be able to reproduce the dataset from scratch.

Model Training

Models are trained using OpenNMT-py:

Example training command:

onmt_build_vocab -config configs/npstereo.yaml -n_sample -1
onmt_train -config configs/npstereo.yaml

Key architectures:

C1 – baseline (canonical SMILES only)
A2–A50 – canonical + randomized SMILES
NPstereo – partially assigned → canonical
M65 – combined augmentation (randomization + partial assignment)

Running Predictions

To predict stereochemistry for new molecules:

python predict.py   --model seed-0/npstereo/npstereo_step_100000.pt   --src input_smiles.txt   --output predictions.txt

Input: SMILES without full stereochemistry (absolute or partially assigned).
Output: Predicted isomeric SMILES with stereochemistry assigned.

For best performance:

Use canonical SMILES as input for NPstereo
For non-canonical SMILES, use models trained with randomization (A50 or M65)

Evaluation

We provide evaluation scripts in evaluation/ to reproduce the metrics reported in the manuscript:

SMILES validity
Full-assignment accuracy
Per-stereocenter accuracy
Class-wise breakdown by NP structural category

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
models		models
plots		plots
predictions		predictions
results		results
.gitignore		.gitignore
01-extract-dataset.ipynb		01-extract-dataset.ipynb
02-augment-data.ipynb		02-augment-data.ipynb
03-prepare-datasets.ipynb		03-prepare-datasets.ipynb
04-prepare-configs.py		04-prepare-configs.py
05-train.py		05-train.py
06-predict-full.py		06-predict-full.py
06-predict-partial.py		06-predict-partial.py
07-evaluate.py		07-evaluate.py
08-analysis.ipynb		08-analysis.ipynb
09-inference-nns.py		09-inference-nns.py
10-inference.py		10-inference.py
LICENSE		LICENSE
README.md		README.md
eval_functions.py		eval_functions.py
npstereo.yml		npstereo.yml
npstereotmap.yml		npstereotmap.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NPstereo

Overview

Installation

Data

Model Training

Running Predictions

Evaluation

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

markusorsi/NPstereo

Folders and files

Latest commit

History

Repository files navigation

NPstereo

Overview

Installation

Data

Model Training

Running Predictions

Evaluation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages