Ensure atom names are unique after construction #469

jchodera · 2019-11-30T19:20:20Z

This PR adds a Molecule._validate() method that is called after all molecule construction events as a way to run any code required to standardize or otherwise validate the constructed Molecule object.

This PR also fixes #460 by ensuring that unique atom names are generated at this time if the specified atom names are not unique. Note that this may cause atom names to change if some were defined, but not all were unique.
This is done via addition of Molecule.has_unique_atom_names property and a Molecule.generate_unique_atom_names() public method.

There are some alternative strategies discussed in #460, but as #460 (comment) points out, if we're going to be standardizing on SDF files (which lack atom names), we need to do something to ensure unique atom names are generated by the time we need to use them.

Additionally, two tests were added:

A SMILES round-trip test, which discovered that it is impossible to do this for molecules that are perceived as lacking stereochemistry
A test for all new added methods

Tag issue being addressed
Add tests
Update docstrings/documentation, if applicable
Update changelog

The failing test here also involves a phosphorous containing molecule ([H]C([H])([H])P(=O)([H])[O-]) suggesting #467 extends to phosphorous as well.

…lowing construction; ensure atom names are unique after construction

jchodera · 2019-11-30T19:27:03Z

Correction: The minidrugbank phosphorous-containing molecule actually has a stereocenter because its substituents are different:

With the RDKit backend:

>>> Molecule.from_smiles('[H]C([H])([H])P(=O)([H])[O-]')
---------------------------------------------------------------------------
UndefinedStereochemistryError             Traceback (most recent call last)
<ipython-input-14-9e8e428709fa> in <module>
----> 1 Molecule.from_smiles('[H]C([H])([H])P(=O)([H])[O-]')

~/miniconda/lib/python3.7/site-packages/openforcefield/topology/molecule.py in from_smiles(smiles, hydrogens_are_explicit, toolkit_registry, allow_undefined_stereo)

~/miniconda/lib/python3.7/site-packages/openforcefield/utils/toolkits.py in call(self, method_name, *args, **kwargs)

~/miniconda/lib/python3.7/site-packages/openforcefield/utils/toolkits.py in from_smiles(self, smiles, hydrogens_are_explicit, allow_undefined_stereo)

~/miniconda/lib/python3.7/site-packages/openforcefield/utils/toolkits.py in _detect_undefined_stereo(cls, rdmol, err_msg_prefix, raise_warning)

UndefinedStereochemistryError: Unable to make OFFMol from RDMol: Unable to make OFFMol from SMILES: RDMol has unspecified stereochemistry. Undefined chiral centers are:
 - Atom P (index 4)

Perhaps we shouldn't be using minidrugbank, or we should filter out the compounds that contain unspecified stereochemistry?

jchodera · 2019-12-02T00:27:02Z

The remaining test failures appear to be due to the SMILES round-trip tests failing because we have some molecules in minidrugbank that have unspecified stereochemistry, yet we load them all in for testing.

What approach should we take with these? Should we skip molecules with unspecified stereochemistry in our round-trip tests?

j-wags

Sorry for delay, this somehow fell off my radar last week.

These changes look technically solid. My one major question is whether we want unique naming to always happen, or to be something that the user chooses to apply (they could call offmol.generate_unique_atom_names manually if they need it).

I think that making the user call it manually is more in line with our philosophy, since

We will in some cases surprise users by overwriting their atom names, since we never told them they need to be unique (and who knows why they named the atoms the way they did)
Atom names have no functional effect on how we parametrize a system, so enforcing rules/behavior about them is misleading at best
A user could "invalidate" a molecule by loading it (so it has unqiue atom names), and then setting one atom's name equal to another's. Then the molecule would fail a round-trip test.

openforcefield/topology/molecule.py

docs/releasehistory.rst

openforcefield/tests/test_forcefield.py

jchodera · 2019-12-14T00:31:16Z

I think that making the user call it manually is more in line with our philosophy

I'm worried that our philosophy is trending toward "annoy the hell out of the user by preventing them from ever doing anything easily with lots of small molecules". As it is right now, the user cannot use our tools to, say,

Process a bunch of SMILES strings that might contain undefined stereochemistry, or even any trivalent sulfur atoms because of presumed stereochemistry atoms
Process any SDF files because these lack atom names, even though we have standardized on SDF files
Process a bunch of molecules from OpenEye or RDKit that may not have unique atom names

I agree that the philosophy of "raise an informative Exception subclass object for each error type that the user must correct" can be an effective way to make sure the user is informed of each and every criteria their molecules must meet to be properly processed, but I can guarantee that this is going to cause a great deal of frustration.

Is there some way we can provide both a "paranoid mode" (where we ensure that we are typing exactly the molecule the user wants) and some sort of "just process these molecules mode" where molecules that don't pass our strict criteria are either transformed into a processable form or are dropped from the list?

Even dropping may be problematic, because---as was pointed out previously---we are standardizing on SDF files, which do not have atom names, which I believe makes it impossible to transform the resulting system to AMBER/gromacs with ParmEd.

If we were following the "principle of least surprise", I'd probably suggest the behavior where missing atom names are filled in silently with unique atom names that do not collide with any existing atom names, but an exception is raised if two user-specified atom names are identical. That would allow processing of molecules without issue in a manner that would not change any atom names the user specified, but would raise an exception if the user-specified atom names caused a problem. Otherwise, they'd be filled in so things just work.

jchodera · 2019-12-14T00:34:52Z

I noted this above, but it's important enough to state again here:

Once we ditch ParmEd, it will no longer be necessary to generate unique atom names. But until we replace it, we're stick with the need to do so if we want our parameters to be deliverable to different molecular simulation packages.

That means that this needs to either happen to all molecules we load into the toolkit, or it means we need to create more code to manage the transformation to other packages that does this silently behind the scenes. That may be an alternative to this approach that is workable!

j-wags · 2019-12-14T01:47:49Z

I wrote a lot, but it boils down to this:

If we were following the "principle of least surprise", I'd probably suggest the behavior where missing atom names are filled in silently with unique atom names that do not collide with any existing atom names, but an exception is raised if two user-specified atom names are identical. That would allow processing of molecules without issue in a manner that would not change any atom names the user specified, but would raise an exception if the user-specified atom names caused a problem. Otherwise, they'd be filled in so things just work.

I think we should do this, at the time of system creation.

So, either we

make a new function, create_parmed_structure, which runs your quoted algorithm after create_openmm_system to make the resulting system parmed-friendly; or
we do this by default in create_openmm_system, since I don't think atom names have a functional effect in OpenMM, so no harm done.

j-wags · 2019-12-14T03:23:28Z

To be a little less brief on my reasoning: I think we need to keep our internal data structures clear of ParmEd's influence. The above discussion was driving us to choose a group of users who would "lose". Either:

People who want to export to ParmEd are inconvenienced by the need to run offmol.generate_atom_names; or
A person who uses OFFTK to generate conformers for alanine dipeptide puts in a molecule with two CAs and gets back conformers with C2 and C5 in their place

When really, we can just push this battle outside our borders -- Make the users identify where these parameters are going, and apply naming that works for ParmEd if that's the destination.

And maybe the second option doesn't seem too bad, but the slope toward ParmEd-ness becomes no less steep if we accept that behavior. We just get into the "if you give a moose a muffin" situation -- If we rename the alanine atoms above, users will say "well, they're just in different residues, OFFTK should allow that". Then we'll be pushed to enforce some system of rules around atom names and residue names/numbers, in a toolkit made specifically not to care about atom types or topological residues. People will make workflows assuming the atom names we assign, and contort their inputs to satisfy the residue rules we enforce, and then we'll have this large set of expected behaviors causing friction on further developments (do virtual sites also need unique names? do we support chains?).

On a separate note:

I agree that the undefined stereochemistry thing is ugly and awkward. It may not be worth it on some fronts. I'd be open to arguments that we should just drop it for trivalent nitrogen and sulfur -- We could say stereochemistry that exchanges more frequently than once per millisecond is not stereochemistry at all for the purposes of atomistic simulation. This will be made easier by breaking out a OFF-core package, where we can keep all the code that enforces our specific stereochemistry definition in one single place.

jchodera · 2019-12-18T05:17:28Z

@j-wags : I like your suggestions in #469 (comment)! I'd be fine with this, but don't have the bandwidth to re-work this PR anytime soon. Should be an easy fix, though!

The only thing I'd caution about is that the wider the separation between doing the thing that triggers an error (converting a molecule with defective atom names) and the error (creating a system), the harder it is for our users to identify where the error was actually committed! So we have to help them along as much as possible by being maximally informative in our error messages.

I agree that the undefined stereochemistry thing is ugly and awkward. It may not be worth it on some fronts. I'd be open to arguments that we should just drop it for trivalent nitrogen and sulfur -- We could say stereochemistry that exchanges more frequently than once per millisecond is not stereochemistry at all for the purposes of atomistic simulation. This will be made easier by breaking out a OFF-core package, where we can keep all the code that enforces our specific stereochemistry definition in one single place.

I still like the idea of an openforcefield-core package!

Another approach that would make our lives easier is to just say we're not going to care about stereochemistry in charge models, and that any model we ultimately adopt would have to somehow average over all possible stereochemical configurations. That would make the cheminformatics easy, but it may ultimately limit accuracy.

jchodera · 2020-01-08T21:29:06Z

@j-wags: How should we proceed here?

We're not going to be able to use SDF files and ParmEd to convert to other packages unless we implement something like this now, and I'd love to get this finished before the master branch diverges too much.

jchodera · 2020-02-03T23:38:55Z

@j-wags: How should we proceed here?

j-wags · 2020-02-04T00:12:30Z

@j-wags: How should we proceed here?

I'll:

~~change create_openmm_system to assign atom names, using your algorithm in this PR~~
disable validation for all molecules, but keep Molecule.generate_unique_atom_names and Molecule.has_unique_atom_names and their tests.
cherry pick 14c1d2f to get the SMILES roundtrip tests working

…tom-names

j-wags · 2020-02-04T03:39:08Z

Oh, OpenMM Systems don't have atom names. Only OpenMM Topologys do. So a to-do item is

add kwarg ensure_unique_atom_names to Topology.to_openmm, using your algorithm in this PR

…o diffs

j-wags

LGTM.

codecov-io · 2020-02-04T22:55:46Z

Codecov Report

Merging #469 into master will not change coverage.
The diff coverage is n/a.

jchodera added 2 commits November 30, 2019 14:06

Add Molecule._validate() to enforce any necessary standardization fol…

d790ee6

…lowing construction; ensure atom names are unique after construction

Update docs

3e45365

jchodera requested a review from j-wags November 30, 2019 19:20

Update PR number

28b595d

Fix failing test that was expecting empty atom names

347a412

j-wags reviewed Dec 14, 2019

View reviewed changes

openforcefield/topology/molecule.py Show resolved Hide resolved

docs/releasehistory.rst Outdated Show resolved Hide resolved

openforcefield/tests/test_forcefield.py Show resolved Hide resolved

j-wags added 5 commits February 3, 2020 16:17

Merge branch 'master' into unique-atom-names

bf6d71a

add ensure_unique_atom_names kwarg to Topology.to_openmm

07f3e3f

remove uses of Molecule._validate since that is not desired in all cases

aa66134

add tests for ensure_unique_atom_names kwarg in Topology.to_openmm

1a956d5

Merge remote-tracking branch 'origin/unique-atom-names' into unique-a…

0ee3dec

…tom-names

j-wags added 4 commits February 3, 2020 19:40

separate smiles roundtrip tests by toolkit, consolidate toolkit stere…

6cb9f62

…o diffs

update release notes for new behavior

278f480

update release notes PR link number

dbc1451

update test_molecule.test_unique_atom_name based on new behavior

26cfb1e

j-wags changed the title ~~[WIP] Ensure atom names are unique after construction~~ Ensure atom names are unique after construction Feb 4, 2020

j-wags self-requested a review February 4, 2020 22:50

Merge branch 'master' into unique-atom-names

d965768

j-wags approved these changes Feb 4, 2020

View reviewed changes

j-wags merged commit 417886f into master Feb 4, 2020

j-wags deleted the unique-atom-names branch February 4, 2020 22:52

SimonBoothroyd mentioned this pull request Feb 6, 2020

Fix Packmol Unique Atom Names openforcefield/openff-evaluator#192

Merged

1 task

j-wags added this to the 0.7.0 milestone Feb 14, 2020

Ensure atom names are unique after construction #469

Ensure atom names are unique after construction #469

Uh oh!

Conversation

jchodera commented Nov 30, 2019

Uh oh!

jchodera commented Nov 30, 2019

Uh oh!

jchodera commented Dec 2, 2019

Uh oh!

j-wags left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jchodera commented Dec 14, 2019

Uh oh!

jchodera commented Dec 14, 2019

Uh oh!

j-wags commented Dec 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-wags commented Dec 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jchodera commented Dec 18, 2019

Uh oh!

jchodera commented Jan 8, 2020

Uh oh!

jchodera commented Feb 3, 2020

Uh oh!

j-wags commented Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-wags commented Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j-wags left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Feb 4, 2020

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

j-wags commented Dec 14, 2019 •

edited

Loading

j-wags commented Dec 14, 2019 •

edited

Loading

j-wags commented Feb 4, 2020 •

edited

Loading

j-wags commented Feb 4, 2020 •

edited

Loading