-
Notifications
You must be signed in to change notification settings - Fork 249
Use rdkit for SSSR and RCs (bug fix + Python upgrade)
#2796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
99bf00c to
8d6fe64
Compare
|
Cantera 2.6 isn't available for Python 3.12, so this PR will also need to upgrade the Cantera version to 3 as mostly completed in #2751 |
|
A note for the path forward on this PR - the |
|
I ended up moving all the functions in |
|
Addressed some of the failures, but still a few nontrivial challenges: TODO list:
The tests are just failing mainly for the aromatic compounds, indicating there's some issue with the implementation of the code. Since everything runs, the code "flow" seems clean, just need to iron out the implementation. |
|
We are now very close to passing all of the unit tests. There are just a couple major bugs to resolve, namely:
|
|
One of the last failing tests is @rwest, since you're the last remaining dev who was involved in some capacity with this part of the code, do you know why we have that assert statement? I would think it's better to just plot a "dot" than to outright crash out. While we're at it, the |
I don't. I can see that if there is no straight chain backbone that a function to
I haven't investigated the full stack trace. But some safeguard, or exception handling, sounds appropriate.
We can try it. I expect RDKit at the time wasn't being helpful. Hopefully the commit messages offer clues? (this is why we should write helpful commit messages that explain why things are being done). |
|
Is this related? |
regarding the draw unit test@rwest yes, thanks, #2744 looks relevant. The test that's failing is for this charged species: maybe failing due to the missing X-O connectivity, and so no backbone? Edit: I tried modifying the connectivity s.t. there is an apparent backbone. But, I still run into this problem even after changing the adjacency list to So it is more fundamentally a problem with the species , than a unit-test specific one. Note about sanitizationFor posterity: some molecules that fail the first kekulization step include: As implemented in this PR, the |
|
Regarding the The test case is drawn as a cyclic molecule, even though it's not truly "cyclic". So it should take the However, it thinks that So, how to proceed? a few options...
|
|
I think using RDKit for drawing as much as possible is fine, if it does a good job. Probably we ran into issues in the past that may not persist with more recent versions of RDKit, so it's fine to revisit past decisions. We should continue to put reasons for things in commit messages, to make life easier for future developers (our future selves included). |
|
OK. Maybe we can incorporate the fix into #2838 and then rebase onto |
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
7598c98 to
019362c
Compare
In ReactionMechanismGenerator#2744 and ReactionMechanismGenerator#2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
6e6490a to
e92bec0
Compare
In #2744 and #2796 it was found that charge-separated bidentate species can have issues due to ring perception conflicts. The previous implementation also by default did not use the rdkit backend for charged species, but this was decided many years ago (~10 years!) In the meantime, RDKit conformer generation has improved and likely this we can just use RDKit by default, which would avoid the pesky edge-case issues for ions/zwitterions. In case the old behavior is desired, use_rdkit can be set to False.
|
With the merger of e92bec0 this PR's CI should now pass - let's see! |
|
@jonwzheng the overnight test failures look spurious - re-running them now. |
We were raising a string, not an exception.
…_cycle We already have this method. The one introduced in c49650c (then rewritten twice) doesn't seem necessarry. This seems to be twice as fast, for a certain test set of molecules. old: %%timeit for mol in mols: mol.identify_ring_membership() # via rdkit FastFindRings for a in mol.atoms: pass 4.41 ms ± 368 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) new: %%timeit for mol in mols: for a in mol.atoms: a.props["inRing"] = mol.is_vertex_in_cycle(a) 2.17 ms ± 203 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
…thods. get_symmetrized_smallest_set_of_smallest_rings is now the only function that works and is appropriately named. get_smallest_set_of_smallest_rings and get_relevant_cycles now just raise NotImplementedError, because they are not actually implemented. This will break things until calls to those methods are replaced everywhere.
get_relevant_cycles and get_smallest_set_of_smallest_rings now just issue a deprecation warning, and return the result of get_symmetrized_smallest_set_of_smallest_rings.
This reverts commit 2d84bbc.
Hopefully nowhere NEEDED the "relevant cycles" which I think might sometimes be larger set than the symmetrized SSSR. Although it is still not, I think, the same as all possible cycles.
Now that this is the only ring-finding algorithm we have, we are forced to use it in bits of code that were previously using either the SSSR or the Relevant Cycles sets of rings.
Since get_smallest_set_of_smallest_rings no longer works, and get_symmetrized_smallest_set_of_smallest_rings is in most cases a more helpful set of rings anyway, we are using the new get_symmetrized_smallest_set_of_smallest_rings method. For some polycyclic species (like cubane) this set is larger. But probably better in most cases.
Because relevant_cycles no longer functions, we can no longer test that it returns atoms in a ring-like order. If we restore the relevant_cycles we should restore this test. I have left test_cycle_list_order_sssr in place, which tests the new get_symmetrized_smallest_set_of_smallest_rings
If you had remove_h = True and sanitize="partial", it would try passing that along to RDKit, which would not expect it. Now we remove H's (without sanitizing) before possibly doing the sanitizing.
So you don't need to create an instance of Fragment() just in order to call the method (seeing as that seems to be how it is mostly used.) Probably it should be a module level function, but this is a minimal change.
Now that we use RDKit to detect rings, we create a lot of rdkit molecules in which we don't care about the bond orders, sometimes we have non-integer or weird bond orders, and we also don't want to waste time sanitizing them since all we care about is connectivity of the molecular graph. So if you know you're calling to_rdkit_mol in this setting, call it with ignore_bond_orders=True and sanitize=False.
This should save complication and time. Also, don't sanitize. All we're going to ask it for is ring information. For that all bonds (that we give it) count. If we want it to ignore H-bonds, or vdW bonds, for some scenarios, then we should figure that out. But often we use this ring detection code for things like drawing, which does want to use those bonds.
This reverts commit 55f8fc4. We should, during ring detection, now be not attempting to convert bond orders into strings, so this warning should not be triggered. We can restore the debug level if this becomes too noisy.
We do a call to detect_cutting_label (which is a complicated function to find them buried inside SMILES strings) passing it the label of every atom. (Though maybe these are mostly ''?). This commit checks whether a simpler approach would work: just checking if the atom.label is either 'R' or 'L'. If that's the case, we can then make the change.
As far as I can tell, cutting labels can only be "L" or "R". I had a previous commit that checked this is equivalent, and by running the unit tests and regression tests could not find a counterexample. This check is much simpler, and about 200 times faster, than the previous call to detect_cutting_label. Also simplified the dictionary structure.
…rnings. Previously, if you call to_rdkit_mol on a Fragment, it would always return the mapping, but if you called it on a Molecule, it would by default not. Now if you call it with return_mapping=False (or not specified) then it doesn't return the mapping, and if you call it with return_mapping=True, then it does. Internally, it calls converter.to_rdkit_mol with return_mapping=True, but doesn't pass it back unless asked. Now the behavior of thing.to_rdkit_mol() is the same whether thing is a Molecule or a Fragment (which is a subclass of Molecule). Should reduce number of warnings, eg. when drawing fragments. Also cleaned up the code a bit, and added some docstrings,
…_rdkit_mol The latest change of to_rdkit_mol means that both Molecule and Fragment return the same signature, and the calling code doesn't need to cope with unexpected tuples.
The comment suggests it was going through Geometry in order to save the mapping, but this can be done with a direct call to to_rdkit_mol using return_mapping=True.
37fd69e to
8715ef1
Compare
|
@rwest I think we figured out the last little detail holding this up. First and foremost, the two tests currently failing are testing the same thing: "can these molecule be correctly detected as bicyclic?" (so the second failing test is redundant). Looking at the actual reason for the failure, it has to do with the way RDKit returns rings for ambiguous systems:
For these two molecules, there are three rings of size 6. The 'true' SSSR is two, so GetSSSR makes an arbitrary but reproducible choice of two of the rings, whereas GetSymmSSSR returns all three. This is an exact case of "The SSSR Problem" described in the RDKit docs: https://www.rdkit.org/docs/GettingStartedInPython.html#the-sssr-problem The exact solution here is not clear:
I think the first option is the best one. But let me know your thoughts. I expect that once we fix the bicyclic detection aspect of this, the numerical results of the tests will also change, but that's an easy fix. |
|
It might be as simple as detecting bicyclics using RDKit's But I think it's important to know where this ring information is used. I think it is the fused ring decomposition algorithm for predicting thermo. If we don't have an exact drop-in replacement, then I fear at least all the groups will need re-training, and in worst case the entire algorithm changing. Is there anyone still on the team who is familiar with that code / algorithm? |
|
I think that proposed solution would be fine. I can't think of many cases where this would be an issue, anyway. My understanding is that this design choice will just change which rings are picked when the decision is arbitrary. The tress were presumably trained using some equally arbitrary decision, so I don't know if they need to be refit (?) @jonwzheng do you know who might know this better? |
|
I didn't catch until just now that they changed the API and GetSSSR now actually returns a sequence of rings, not just the count. |
I don't think any current members will know. @mjohnson541 did some of the automatic thermo and kinetic re-fitting, and before that @mliu49 worked on polycyclic corrections. There are some examples here like the correction for barrelene-related groups that seem to have some hints. |
|
@rwest in that case, can we just go ahead and switch to |
|
Worth a try? I'm a bit unsure how/if they solved the inconsistency problem that led them to earlier only report the number of SSSSR and not the set. |


Currently we use
RingDecomposerLibfor finding the Smallest Set of Smallest Rings and getting the Relevant Cycles. This package does not support Python 3.10+ and is thus blocking further upgrades to RMG.@KnathanM in particular is looking to get RMG to Python 3.11 so as to add support for ChemProp v2.
I believe we can just use RDKit to do these operations instead. The original paper mentions that the functionality was being moved upstream to RDKit. With the help of AI I've taken just a first pass at reimplementing, with the special note that:
get_deterministic_sssris not really deterministic #2562This PR will be a draft for now, as it is predicated on Python 3.9 already being available (which it nearly is in #2741)
Motivation or Problem
A clear and concise description of what what you're trying to fix or improve. Please reference any issues that this addresses.
Description of Changes
A clear and concise description of what what you've changed or added.
Testing
A clear and concise description of testing that you've done or plan to do.
Reviewer Tips
Suggestions for verifying that this PR works or other notes for the reviewer.