Skip to content

A Benchmarking Platform for Realistic And Practical Inverse Molecular Design

Notifications You must be signed in to change notification settings

aspuru-guzik-group/Tartarus

Repository files navigation

Tartarus: Practical and Realistic Benchmarks for Inverse Molecular Design

This repository contains the code and results for the paper Tartarus, an open-source collection of benchmarks for evaluation of a generative model.

Total installation time: ~15-20mins.

Installing XTB and CREST

The task of designing organic photovoltaics and emitters will require the use of XTB, a program package of semi-empirical quantum mechanical methods, and CREST, a utility of xtb used to sample molecular conformers.

The binaries are provided in here. Place in home directory, and the software can be sourced using

export XTBHOME=${HOME}/xtb
export PATH=${PATH}:${XTBHOME}/bin
export XTBPATH=${XTBHOME}/share/xtb:${XTBHOME}:${HOME}
export MANPATH=${MANPATH}:${XTBHOME}/share/man

Installing SMINA

The task of designing molecules that dock to proteins requires the use of SMINA, a method for calcualte docking scores of ligands onto solved structures (proteins). The binary file is already included in the repository, in tartarus/docking_structures/smina.static.

Packages required

Use python >= 3.8. We recommend using a conda environment for the installation of

  • rdkit >= 2021.03.3
  • xtb-python >= 20.1
  • openbabel == 3.1.1

Required packages:

  • numpy >= 1.22.3
  • pandas >= 1.4.3
  • torch == 1.12.0
  • pyscf == 2.0.1
  • morfeus-ml >= 0.7.1
  • geometric == 0.9.7.2
  • pyberny == 0.6.3
  • loguru == 0.6.0
  • geodesic-interpolate == 1.0.0
    (pip install -i https://test.pypi.org/simple/ geodesic-interpolate)
  • polanyi == 0.0.1
    (pip install git+https://github.com/kjelljorner/polanyi)

Datasets

All datasets are found in the datasets directory. The arrows indicate the goal (↑ = maximization, ↓ = minimization).

Task Dataset name # of smiles Columns in file
Designing OPV hce.csv 24,953 PCEPCBM -SAS (↑) PCEPCDTBT -SAS (↑)
Designing emitters gdb13.csv 403,947 Singlet-triplet gap (↓) Oscillator strength (↑) Multi-objective (↑)
Designing drugs docking.csv 152,296 1SYH (↓) 6Y2F (↓) 4LDE (↓)
Designing chemical reaction substrates reactivity.csv 60,828 Activation energy ΔE (↓) Reaction energy ΔEr (↓) ΔE + ΔEr (↓) - ΔE + ΔEr (↓)

Getting started

Below are some examples of how to load the datasets and use the fitness functions. For more details, you can also look at example.py.

Designing organic photovoltaics

To use the evaluation function, load either the full xtb calculation from the pce module, or use the surrogate model, with pretrained weights.

import pandas as pd
data = pd.read_csv('./datasets/hce.csv')   # or ./dataset/unbiased_hce.csv
smiles = data['smiles'].tolist()
smi = smiles[0]

## use full xtb calculation in hce module
from tartarus import pce
dipm, gap, lumo, combined, pce_pcbm_sas, pce_pcdtbt_sas = pce.get_properties(smi)

## use pretrained surrogate model
dipm, gap, lumo, combined = pce.get_surrogate_properties(smi)

Designing Organic Emitters

Load the objective functions from the tadf module. All 3 fitness functions are returned for each smiles.

import pandas as pd
data = pd.read_csv('./datasets/gdb13.csv')  
smiles = data['smiles'].tolist()
smi = smiles[0]

## use full xtb calculation in hce module
from tartarus import tadf
st, osc, combined = tadf.get_properties(smi)

Design of drug molecule

Load the docking module. There are separate functions for each of the proteins, as shown below.

import pandas as pd
data = pd.read_csv('./datasets/docking.csv')  
smiles = data['smiles'].tolist()
smi = smiles[0]

## Design of Protein Ligands 
from tartarus import docking
score_1syh = docking.get_1syh_score(smi)
score_6y2f = docking.get_6y2f_score(smi)
score_4lde = docking.get_4lde_score(smi)

Design of Chemical Reaction Substrates

Load the reactivity module. All 4 fitness functions are returned for each smiles.

import pandas as pd
data = pd.read_csv('./datasets/reactivity.csv')  
smiles = data['smiles'].tolist()
smi = smiles[0]

## calculating binding affinity for each protein
from tartarus import reactivity
Ea, Er, sum_Ea_Er, diff_Ea_Er = reactivity.get_properties(smi)

Results

Our results for running the corresponding benchmarks can be found here:

Questions, problems?

Make a github issue 😄. Please be as clear and descriptive as possible. Please feel free to reach out in person: (akshat98[AT]stanford[DOT]edu, robert[DOT]pollice[AT]gmail[DOT]com)

License

Apache License 2.0