Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comments to ff fitting example, + some function signatures/docstrings #2

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lilyminium
Copy link

Hi @trevorgokey, I was going through the 07_besmarts_fit example and making some notes for myself on what each line does / the Python object structures, as the tutorial tends to show a large portion of code at once and pull out some highlights to explain. Just to check my understanding, do you reckon the comments are accurate or have I completely misunderstood some of what's going on? I do have some questions remaining on the data structures that I'll add as comments.

In case you think the notes will be useful to anyone else, I've opened this PR just to suggest adding them in.


# 2. Configure the objective tiers
final = fits.objective_tier()
# set the objectives for the final tier
# each value is an ``objective_config``
final.objectives = {
0: fits.objective_config_position(
Copy link
Author

@lilyminium lilyminium Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure what the keys final.objectives represent -- do they represent order, or do they maybe correspond with the assignments.POSITIONS index? If the latter, constructing the dict with {assignments.POSITIONS: fits.objective_config_position(... might be clearer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The objective keys are completely arbitrary ints; they represent order. I did it more for the case of when the the objective list gets split up they maintain their "index"

Copy link
Author

@lilyminium lilyminium Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank you! So is it more of an administrative/logging device to track each objective individually, or does it affect how the objective is computed? If it's just a naming thing, are integers the only keys accepted and does it expect them to be contiguous?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have no almost no impact on how the objective is computed, other than the order they are added (FIFO; put expensive things first). I haven't put much thought into the objective keys, but I've been assuming ints. Looking through the code, I think the only part that uses the objective key is when I break up the particular computation into pieces, I keep track of ownership using the key. So, I would guess anything hashable is usable as a key. For example I have this in optimizers_scipy:

tasks[(n, (i, (0,)))] = task

where n is the iteration number, i is the objective key in question, and (0,) refers to no parameter deltas (e.g. the f(x0) in the numerical derivatives. This is mostly useful for Hessians, where a molecule can have 100s of deltas to compute, and I break them into pieces and distribute them for computation. I collect the results based on the objective key (i above). That said, they shouldn't need to be continuous, just something hashable to organize computations with.

# create a graph and add it to the graph_db
g: graphs.graph = gcd.smiles_decode(smi)
# molecule_graph_id is the graph ID of the graph in the graph_db
molecule_graph_id: int = assignments.graph_db_add_graph(gdb, smi, g)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use besmarts with multi-molecule data (e.g. dimers) at all?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff like dimers is the reason I made the much more complicated graph_db data structure.. however right now nearly everything assumes a single molecule, so it would need some work/checking to get going correctly. Each molecule coordinates would be a graph_db_column and the graph_db_row could be considered a single snapshot/frame, where each of these would be associated to the graph_db_graph of the same graph ID.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I think I understand. So can a graph.graph contain multiple molecules? IIRC the graph_same check only checks encoded primitives so coordinates wouldn't differentiate the molecules. So I'm wondering if a dimer energy of identical molecules would need to reference the same graph ID twice, and how that would interact with graph_db_tables.gid.

(Just to check my understanding -- does the below sound right to you?)
Each graph_db_entry in the gdb contains graph_db_tables, which contains graph_db_graphs that are referenced by gid. Coordinates are stored separately by atom in graph_db_columns, which are collected in graph_db_rows. In the case of using an interaction energy between two different molecules, I think the value at gde.tables[assignments.ENERGIES].values[snapshot_i] would be the interaction energy at gde.tables[assignments.POSITIONS].graphs[gid].rows[snapshot_i]?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been treating all graphs as fully connected; I haven't gone into much thinking/testing whether disconnected graphs should be allowed as of yet. There are pros and cons to doing so, but right now I'm leaning towards having everything connected (and maybe adding empty bonds for specific problems; these would encode to . in the string representation). I believe everything in graphs is strictly 2D based, so yes the graph operations won't pick up on different conformations. The dimer would most certainly reference the same graph ID; the underlying graph is how selections are referenced in the graph_db_columns (if you have IC data; perhaps less relevant here). As an example you can query bonds of a graph, say (1, 2) by calling graphs.graph_bonds(gdb.graphs[gid]), which would be a selection in the graph_db_column, such as gde.tables[BONDS][gid][rid][cid][(1,2)] (if you put data there). There is no way to specify conformations to a graph object; think of them as part of a regular FF topology. The easiest way to do so would be a assignments.graph_assignment object, which are used in most places needing positions.

And yes, that is a quick way to make that work and probably the only sane way to do so, showing the usefulness of having a dumb values member :) I read that as the total energy for a particular snapshot. There may be two gids each with one column for snapshot_i, or the same gid with two columns with different sets of coordinates.

@trevorgokey
Copy link
Owner

Hi @lilyminium, I appreciate your effort here. I think the intention here is to leave this up for some time as a means to take notes, or are you done? In general I would be happy to merge your input as part of improving the example's clarity.

energy_data = f.read().split("\n")
# assume energy data is a list of floats with one energy per line
energy_values = [*map(float, [x for x in energy_data if x])]
energy_gdt.values.extend(energy_values)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this example contained multiple observations to fit to (so multiple gids I assume), how would besmarts map each energy value to the corresponding coordinates? I could be missing something here but I don't see any modification of energy_gdt.graphs (although I guess energies aren't actually part of this example!)

Copy link
Owner

@trevorgokey trevorgokey Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I never really intended to use energies in this example, and this snippet is more destructive than constructive. The value member field was just a lazy way to append data without having to deal with the data structure as I worked on various objective functions (hessian freqs, DLC gradients, etc), so I consider the values member to be "user defined" :). To answer your question generally, the (proper/designed) way to do this would be to set up a graph_db table (with a null topology since it is per graph/molecule/system) and then you would assign energies as you would like. I note that a weakness here is that I don't have a great way to set a total energy anywhere, it would have to be a sum over all graph_db_graph objects for a particular row_id. This would give you a bunch of graph_db_columns, and for null topologies I reserve the selection (0,) to refer to the entire molecule, remembering that I use 1-index for selections for atoms so 0 is never used anywhere else. (I haven't actually done this anywhere, but this is how I will do it once I start fitting energies). This would be mostly for QM energies, but it should be possible to do IC energies instead for MM. One could set up a custom addressing to store the totals in the values member; this greatly simplifies computing the objective at the end, which is essentially why I added it in the first place.

@lilyminium
Copy link
Author

Thanks Trevor, I still have a couple questions that might help with the notes but in general happy for you to use this however you like, and merge if you think it'll be helpful! I can let you know when I'm done updating.

@trevorgokey
Copy link
Owner

I've updated this example and refactored it to use stuff that is in core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants