-
Notifications
You must be signed in to change notification settings - Fork 249
Modify node regularization in kinetic families. #2856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Modify node regularization in kinetic families. #2856
Conversation
|
Use testing_regularization.ipynb for testing. |
|
As a sanity check to see if this would work on a family other than Retroene, I ran this fix on 1+2_Cycloaddition family. Tree is generated and regularized without errors and passes final isomorphism check. Here's a file that shares the nodes that were unregularized, and the resulting group after regularization with my fix: 1+2_Cycloaddition.txt |
|
New groups and their complements are regularized at the same time in the same way: RMG-Py/rmgpy/data/kinetics/family.py Line 3005 in d65c5f3
I was able to identify an issue with leaf nodes though (and Root_N-4R!H->C is a leaf node). Since regularization dimensions are computed when we extend a group, leaf nodes where we didn't need to extend seem to only have the copied regularization dimensions of their parent. I added a step in PySIDT regularization where it runs extension expansion (without recursive extension expansion for the split exiting before collecting groups to be expanded, which is quite quick) on each leaf node before regularizing it and that seemed to fix things. |
|
Another comment, mostly to keep track of my thinking. There is definitely a more sophisticated way to fix this issue that doesn't involve reading in atomtypes directly from the training reactions of that node. It's probably better to keep with the architecture that Matt already has. Similar to the fix with the leaf nodes Matt mentioned, I see now that the problem should be solved if the problematic node then acts as a parent node and generates a further extension that is non-splitting on the atom of interest (even if that isn't the extension that is ultimately selected to extend the tree). This would update the reg_dict to include regularization information on the atom of interest in the parent (our problematic node). i.e. our problematic node can be passed regularization info for *5 if a non-splitting extension could be found on that atom. However, some debugging is showing me that attempts to extend this node doesn't propose any non-splitting extensions on *5 (only on *2 and *6). Hence why there's no regularization information for that atom in the group, and why it isn't regularized. I'm still figuring out how exactly Looks like all the training reactions at this node have their *5 atom as N or C, so I would think there should be a proposed extension on *5. Going to focus now on figuring out 1) whether or not there should be a non-splitting extension on *5 and 2) if there shouldn't be, how can we still pass regularization information to *5? |
|
Looks like this fix works on small families, however, I'm not seeing regularization on families that use the cascade algorithm (i.e. R_Addition_MultipleBond). For example, this node here remains unregularized (*3 should only have F as an atomtype). I can load in the tree, remove the children from this node (to simulate first time tree generation) and get the extension edge, and I get the correct regularization dimensions and can run simple regularization so that the node regularizes to: I think regularization dimensions clear between cascade algorithm steps (commit 5e753c9). Working on handling this. |
|
To save time, above commit was tested by regenerating tree of Retroene family instead of a bulky family. This family only has ~70 training reactions and doesn't require the cascade algorithm, but we can build the tree with cascading if we lower max_batch_size to 20 when generating tree. Confirmed that nodes in Retroene family are regularized when cascading is not used, but regularization is not performed on non-leaf nodes when cascading is used on this family. |
|
Used this fix to regenerate the R_Addition_MultipleBond family tree, and looks like regularization is now being performed on most nodes in this family. There are some that remain unregularized, but I think these are the nodes that have >400 training reactions. |
|
Hopefully the CI is now fixed, so please could you rebase onto the latest |
…tead of using their atom symbol to make an atomtype
…ows problematic nodes to generate atomExt extensions that aren't node splitting if the optimization dimension of the regularization dictionary is more specific than the atomtype at the atom of interest being extended. For example, if the atomtype of an atom labeled *5 is [Si, F, Li, N, C, P, S] and the regulatization dictionary has an optimization dimension that narrows down these atomtypes (i.e. reg_dim_atm[0] = <N,C>), then we can allow for atomExt extensions that change *5's atomtype to be [N,C] (rather than just [N] or just [C]). This way, we have an extension that narrows down *5 to <N,C> from [Si, F, Li, N, C, P, S] but also matches all of the training reactions at the node, so the regularization information (reg_dim_atm{1]) is passed to the group.
… regularization info is passed (but not actually extending them).
…cade algorithm. Allows optimization dimensions of reg_dim_atm dictionary to clear (first list in dictionary), but preserves the regularization dimension in reg_dim_atm (second list in dictionary).
214865a to
be27bb8
Compare




Motivation or Problem
Noticed that the nodes of some families were not being regularized. For example, in groups.py of the Retroene family, the following entry (below) specifies that the *5 atom can be a fluorine, but the entire training set of the family includes zero fluorinated reactions.
This indicates that node regularization was not carried out on this node. My understanding of node regularization is that it is performed to limit how "general" a node is and ensures that the node fits it's reaction data "tightly." Successful regularization on this node would mean that this node should not explicitly specify atoms like I, P, Br, Cl, Si, S, F, or Li if they are not included in the training reaction set. There are numerous other nodes in this family that are not regularized, and this is also occurring in other families. This is a problem because, if you were descending a fluorinated reaction down this tree, it would fall to a deeper depth down the tree to a node with fewer training reactions compared to the Root, but those fewer training reactions are not at all representative of the fluorinated reaction that we want to estimate a rate for. I could see this being an issue for other chemistries involving Br, Cl, Si, S, Li, etc.
After some digging, I think regularization is being skipped because:
Description of Changes
Different approach to passing the regularization information to each node. After an extension is selected, if that extension is of the type
atomExt(i.e. changing an atom's atomtype R -> C) that splits the training reactions of the parent node into an extension node and complement node, then regularization info is passed to the extension node AND the compliment node. Regularization info of compliment is found by analyzing all of the reactions that fit the complimentary node and determining which atomtypes are in those reactions. After this information is found, regularization info is also passed to the parent node (and this information is no longer an empty list).Testing
Simplest testing is with Retroene family.
[Si,Li,S,N,P,F,Br,I,Cl,C])