Skip to content

Spec for undefined stereochemistry #146

@j-wags

Description

@j-wags

Purpose

Because it's a complicated topic, and to maintain the ability to restart this discussion in the future, I think it's important to record our design decisions with regard to stereochemistry in this repository.

Background

Stereochemical differences are important in the SMIRNOFF spec, as they can lead to different partial charges or partial bond orders on a molecule, and (if the project goes in that direction), different assignment of library parameters using stereochemistry-dependent SMARTS.

However, the definition of meaningful stereochemistry can change depending on experimental conditions and timescale. For example a tertiary amine will invert 10^3 to 10^5 times per second in solution. Therefore, on a wet-lab-experiment timescale, the amine does not have meaningful stereochemistry. However, if one is running a QM calculation on the tertiary amine, there may be meaningful differences in the results depending on its stereochemistry. Thus, whether or not stereochemistry is required in a tertiary amine depends on your use case.

Our definition of stereochemistry will ideally cover the fact that MD simulations can run for micro- and milli-seconds, therefore molecular features that are likely to change stereochemistry on those timescales should not be assigned stereochemistry-specific properties. It is unclear currently whether this decision should be made at the point of SMIRNOFF molecule definition (ie, we enforce that short-lived stereochemical features do not distinguish one OFF molecule from another, thereby preventing our engine from assigning them different simulation parameters), or if the SMARTS that define our parameter libraries should be policed moving forward, to ensure that no two parameters are differentiated only by low-barrier stereochemistry (eg, we will never create a SMARTS that relies on the stereochemistry of a tertiary amine).

Current implementation plans

Read below for details, but I think a good plan is to let the toolkits use their own (slightly different!) standards in choosing how to accept SMILES. While they mostly agree with what "fully specified stereochemistry" is, they differ on some borderline cases.

While this isn't ideal, ensuring the same stereochemistry standard across toolkits will require a lot of engineering, and will restrict us to the union of the most restrictive stereochemistry standards of all of our toolkits (eg. OE omega complains if bond stereo isn't defined for CC=NH, which is a bit much).

3D molecule input

We haven't run into too many problems with sdf and mol2 input. OE and RDK currently use their default stereochemistry detection in these cases and seem to rarely have problems. I haven't dug very deeply into these, so I will update this post as I look further into it.

Potentially relevant to 3D stereochemistry detection: #141, #137

SMILES input

"Common sense" atom and double bond stereochemistry is required, as is implemented by default in OE and RDK. There are, however, a few cases where disagreement occurs, both between OE and RDK, and between a toolkit and our intuition.

This table records the cases of obvious and border-case stereochemistry we've found so far, whether we think molecules should labeled as having undefined stereochemistry (on the MD timescale), and whether the toolkits define the SMILES as having undefined stereochemistry. If applicable, experimental barriers/interconversion rates from literature should be added.


SMILES Description Humans say it has undefined stereochem? OE says it has undefined stereochem? RDK says it has undefined stereochem? barrier conversion rate
C(Cl)(F)(Br)C Undefined tetrahedral center Yes Yes Yes
C(F)=C(Cl) Undefined C=C double bond Yes Yes Yes
C\C(F)=C(/F)CC(C)(Cl)Br Undefined stereocenter, defined stereobond Yes Yes Yes
CC(F)=C(F)C[C@@](C)(Cl)Br Defined stereocenter, undefined stereobond Yes Yes Yes
C\C(F)=C(/F)C[C@@](C)(Cl)Br defined stereocenter, defined stereobond No No No
CN=C(F)C Undefined imine Yes Yes Yes > 40 kcal/mol
N=C(F)C Undefined primary imine No Yes No
N(Cl)(F)C Undefined pyrimidal nitrogen No Optional [1] No 10^3 - 10^5 sec^-1
CP(O)(=O)N tetrahedral phosphate with one protonated oxygen Yes Yes Yes
CNC(C)=O keto-form of amide bond No No No
CN=C(C)O enol-form of amide bond [2] Yes Yes
C(=[O+][H])C oximium ? Yes No
N(C)(CC)(CCC)=O ? Yes? Yes No
C[C@H]1CN1C, may have tested using different method trivalent N in 3-membered ring Yes, as of 2019.3 Yes Yes
C[C@H]1CN1, may have tested using different method trivalent N in 3-membered ring (1 hydrogen) ? No? Yes
[H]c1c(c(c(c(c1[H])[H])P2C(=C3C(=C2c4c(c(c(c(n4)[H])[H])[H])[H])C(C(C(C3([H])[H])([H])[H])([H])[H])([H])[H])c5c(c(c(c(n5)[H])[H])[H])[H])[H])[H] trivalent phosphorous ? No Yes considered stereogenic starting in RDK 2020.3 release rdkit/rdkit#2949
CCS(=O)C thione ? ? ? ? ?
COP(=O)([O-])OCC deprotonated NA phosphate backbone N ? ? ? deprotonated at physiological pH (pKa=2ish)

[1] OE says "Yes" when using method A, but when "No" when method B is given enum_nitrogens=False.
[2] I'm not certain what the rotation timescale for amide bonds is, but this might just be in molecular dynamics range. Letting the enol form of the amide get stereochemistry might give users the choice of whether they want it defined.

Code to reproduce the above

from rdkit import Chem 
from rdkit.Chem import EnumerateStereoisomers

# Define the SMILESes, 
smiles_ambiguous = [
                    ('C(Cl)(F)(Br)C', True), # Undefined tetrahedral center
                    ('C(F)=C(Cl)', True), # Undefined RC=CR 
                    ('C(F)(Br)=C(Cl)(I)', True), # Undefined RC=CR 
                    ("C\C(F)=C(/F)CC(C)(Cl)Br", True), # Undefined stereocenter, defined stereobond
                    ("CC(F)=C(F)C[C@@](C)(Cl)Br", True), # Defined stereocenter, undefined stereobond
                    ("C\C(F)=C(/F)C[C@@](C)(Cl)Br", False), # defined stereocenter, defined stereobond
                    ('CN=C(F)C', True), # RC=NR
                    ('N=C(F)C', False), # RC=NH
                    ('N(Cl)(F)C', False), #Pyrimidal nitrogen center
                    ('CP(O)(=O)N', True), #phosphate
                    ('CNC(C)=O', False), # keto form of amide
                    ('CN=C(C)O', False), #enol form of peptide
                    ('C(=[O+][H])C', False), # charged oxygen
                    ('N(C)(CC)(CCC)=O', True), # ?
                  ]


def rdkit_unspec_stereo(smiles):
    from rdkit import Chem
    from rdkit.Chem import EnumerateStereoisomers

    rdmol = Chem.MolFromSmiles(smiles)

    # Adding H's can hide undefined bond stereochemistry, so we have to test for undefined stereo here
    unspec_stereo = False
    rdmol_copy = Chem.Mol(rdmol)
    enumsi_opt = EnumerateStereoisomers.StereoEnumerationOptions(maxIsomers=2, onlyUnassigned=True)
    stereoisomers = [isomer for isomer in Chem.EnumerateStereoisomers.EnumerateStereoisomers(rdmol_copy, enumsi_opt)]
    if len(stereoisomers) != 1:
        unspec_stereo = True

    if unspec_stereo:
        return True
    return False

def oe_unspec_stereo_A(smiles):
    molecule = oechem.OEMol()
    oechem.OESmilesToMol(molecule, smiles)
    atoms_to_highlight = list()
    bonds_to_highlight = list()
    for atom in molecule.GetAtoms():
        if atom.IsChiral() and not atom.HasStereoSpecified(oechem.OEAtomStereo_Tetrahedral):
            # Check if handness is specified
            v = []
            for nbr in atom.GetAtoms():
                v.append(nbr)
            stereo = atom.GetStereo(v, oechem.OEAtomStereo_Tetrahedral)
            if stereo == oechem.OEAtomStereo_Undefined:
                atoms_to_highlight.append(atom)
    for bond in molecule.GetBonds():
        if bond.IsChiral() and not bond.HasStereoSpecified(oechem.OEBondStereo_CisTrans):
            v = []
            for neigh in bond.GetBgn().GetAtoms():
                if neigh != bond.GetEnd():
                    v.append(neigh)
                    break
            for neigh in bond.GetEnd().GetAtoms():
                if neigh != bond.GetBgn():
                    v.append(neigh)
                    break
            stereo = bond.GetStereo(v, oechem.OEBondStereo_CisTrans)

            if stereo == oechem.OEBondStereo_Undefined:
                bonds_to_highlight.append(bond)
    if len(atoms_to_highlight+bonds_to_highlight) == 0:
        return False
    return True
    

def oe_unspec_stereo_B(smiles):
    from openeye import oechem
    from openeye import oeomega
    maxcenters = 5
    force_flip = False
    enum_nitrogens = False
    mol = oechem.OEMol()
    if not oechem.OESmilesToMol(mol, smiles):
        print(s, "Not Parsed")
    n_isomers = 0
    for isomer in oeomega.OEFlipper(mol, maxcenters, force_flip, enum_nitrogens):
        n_isomers += 1
    if n_isomers == 1:
        return False
    elif n_isomers > 1:
        return True


method_list = [rdkit_unspec_stereo, oe_unspec_stereo_A, oe_unspec_stereo_B]
for smiles, should_be_ambiguous in smiles_ambiguous:
    method_results = [method(smiles) for method in method_list]
    print(smiles, should_be_ambiguous, method_results[0], method_results[1], method_results[2])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions