Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added and integrated C++ graphium_cpp library, a Python module implem… #510

Merged
merged 54 commits into from
Jul 9, 2024

Conversation

DomInvivo
Copy link
Collaborator

@DomInvivo DomInvivo commented Apr 16, 2024

Implented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.

Changelogs

  • Move all the molecular featurization (atoms, bonds, positional encodings, etc.) to C++
  • Enable dataloading directly from Smiles during runtime
  • Improve memory + speed of dataloading by >10X

Authors: Most changes from @ndickson-nvidia , with some minor adjustment from @DomInvivo

discussion related to that PR

That PR will allow Graphium to perform much much faster, and unlock a new usage of positional encodings since they won't be a bottleneck anymore. Smiles -> pyg graph + pos encodings will now be done directly during dataloading.

…ented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.
Copy link
Collaborator Author

@DomInvivo DomInvivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the work there! It's pretty impressive!

I tried to check the code as thoroughly as possible, but I will go through it again once these first comments are addressed. I haven't checked the accuracy of the C++ code for positional encodings.

One recurrent comment is that we could potentially discard any possibility of featurizing with Python, and only support C++ featurization. Otherwise, everything will be much slower since dataloading and transformation is now done in real-time, and any option to parallelize and cache the featurization is now gone.

env.yml Show resolved Hide resolved
expts/hydra-configs/architecture/toymix.yaml Outdated Show resolved Hide resolved
graphium/config/_loader.py Show resolved Hide resolved
Comment on lines 100 to 105
if "features" in elem:
num_nodes = [d["features"].num_nodes for d in elements]
num_edges = [d["features"].num_edges for d in elements]
else:
num_nodes = [d["num_nodes"] for d in elements]
num_edges = [d["num_edges"] for d in elements]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved outside of the for key in elem:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Because we know that the only possible keys in MultitaskDataset are now "labels", "features", "num_nodes", and "num_edges", is it safe to remove support for other keys and just access these 4 directly, instead of looping over the keys of one item?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep support for any key, in case a user wants to use other custom keys. Also, aren't some positional encodings using other keys?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feature data's under "features" and all label data's under "labels". MultitaskDataset::__getitem__ does:

        if self.mol_file_data_offsets is None:
            datum = { "features": self.featurize_smiles(smiles_str) }
        else:
            datum = {
                "labels": self.load_graph_from_index(idx),
                "features": self.featurize_smiles(smiles_str),
            }

and returns datum after an error check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah, you're right. Still, I think collation should be flexible, and call back to default_collate for anything not features or labels

graphium/data/datamodule.py Show resolved Hide resolved
graphium/graphium_cpp/features.cpp Show resolved Hide resolved
graphium/graphium_cpp/random_walk.cpp Show resolved Hide resolved
Comment on lines 52 to 56
// TODO: Decide what to do about legitimately complex eigenvalues.
// This should only occur with Normalization::INVERSE, because real, symmetric
// matrices have real eigenvalues.
// For now, just assume that they're supposed to be real and were only complex
// due to roundoff.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eigenvectors of every graph Laplacian are never complex, even if D-1 L is non-symmetric.

The normalized graph Laplacian matrix, typically denoted as ( L_{\text{norm}} ), is defined as:

[ L_{\text{norm}} = D^{-1} L ]

where ( D ) is the degree matrix (a diagonal matrix with the degree of each vertex on the diagonal), ( L ) is the Laplacian matrix defined as ( L = D - A ) (with ( A ) being the adjacency matrix of the graph), and ( D^{-1} ) is the inverse of the degree matrix.

For an undirected graph, the matrix ( L_{\text{norm}} = D^{-1} L ) can be written as:

[ L_{\text{norm}} = I - D^{-1} A ]

where ( I ) is the identity matrix. Here, ( D^{-1} A ) can be thought of as the matrix of transition probabilities for a random walk on the graph, and this matrix is not necessarily symmetric. However, ( L_{\text{norm}} ) is similar to the symmetric matrix ( D^{-1/2} L D^{-1/2} ) (which is another common form of the normalized Laplacian often denoted as ( \mathcal{L} )). The similarity transformation is given by:

[ L_{\text{norm}} = D^{-1/2} \mathcal{L} D^{1/2} ]

This implies that ( L_{\text{norm}} ) and ( \mathcal{L} ) have the same eigenvalues, and the eigenvectors of ( L_{\text{norm}} ) are related to the eigenvectors of ( \mathcal{L} ) by a scaling of ( D^{-1/2} ). Since ( \mathcal{L} ) is symmetric, its eigenvectors can be chosen to be real and orthogonal.

Therefore, the eigenvectors of ( L_{\text{norm}} ) can also be chosen to be real. This conclusion holds as long as the graph is undirected and the degree matrix ( D ) does not contain any zeros (i.e., there are no isolated vertices). Thus, in typical scenarios where the graph is undirected and connected, ( L_{\text{norm}} ) does not have complex eigenvectors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use this to our advantage. If inverse is asked, we could just compute the regular eigvecs/eigvals, then rescale the eigvecs, thus allowing us to use eigh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Yeah, it would be good if the code can exclusively use eigh, to avoid the issues I was hitting with eig sometimes producing complex eigenvectors with non-zero imaginary components.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now implemented the inverse case in terms of eigh and removed the complex eigenvector and eigenvalue cases, because the documentation for eigh says that the types should be the same as the input types. I haven't tested it, but the idea is in there.

}
}
else if (eigenvector_tensor.scalar_type() == c10::ScalarType::ComplexDouble) {
// TODO: Decide what to do about legitimately complex eigenvectors.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

tests/test_collate.py Show resolved Hide resolved
…kDataset is still used in test_dataset.py, but won't after it's changed in a later commit.
…ng from the Python featurization yet), removed option to featurize to GraphData class instead of PyG Data class, added deprecation warnings to datamodule.py for parameters that are now unused, some cleanup in MultitaskFromSmilesDataModule::__init__, changed tensor index variables to properties, added preprocessing_n_jobs (not yet used), etc.
… symmetric diagonalization, avoiding the need to handle complex eigenvectors and eigenvalues
@DomInvivo DomInvivo changed the base branch from main to graphium_2.0 April 22, 2024 14:55
… match behaviour from get_simple_mol_conformer Python code, but adding Hs, as recommended for conformer generation.
…catenate_strings function, though it's not used yet.
Copy link
Collaborator Author

@DomInvivo DomInvivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments left. Sorry that I contradicted myself on the deprecation, I think we can just say that this release of graphium 2.0 is not backward compatible, and avoid making the code ugly.

@@ -790,8 +788,7 @@ class MultitaskFromSmilesDataModule(BaseDataModule, IPUDataModuleModifier):
def __init__(
self,
task_specific_args: Union[Dict[str, DatasetProcessingParams], Dict[str, Any]],
processed_graph_data_path: Optional[Union[str, os.PathLike]] = None,
dataloading_from: str = "ram",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to contradict myself, but if we release a version 2.0, with the expectations that things break, can we just remove this options without the deprecation warning? I would keep a note in the docstring.

graphium/data/datamodule.py Outdated Show resolved Hide resolved
MaskNaNStyle mask_nan_style = MaskNaNStyle(mask_nan_style_int);
int64_t num_nans = 0;
int64_t nan_tensor_index = -1;
if (dtype == c10::ScalarType::Half) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to repeat the exact same if/else if statement 3 times, except for the <int16_t> or <float> or <double>? Can you use a variable for that?

My C++ is a bit rusty, it has been years.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ templates (e.g. the <int16_t> or <float> or <double>) need to be able to be evaluated at compile-time, whereas the if statements are evaluated at runtime. These are the if statements that check the type at runtime once, so that it can be encoded in the template, and then it doesn't need to be evaluated at runtime again.

@DomInvivo DomInvivo marked this pull request as ready for review June 5, 2024 04:18
@DomInvivo DomInvivo deleted the branch datamol-io:graphium_3.0 June 27, 2024 21:21
@DomInvivo DomInvivo closed this Jun 27, 2024
@DomInvivo DomInvivo reopened this Jun 27, 2024
@DomInvivo DomInvivo changed the base branch from graphium_2.0 to graphium_3.0 June 27, 2024 21:23
Copy link
Collaborator Author

@DomInvivo DomInvivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all files. Ready to merge!

@DomInvivo DomInvivo merged commit 7f933b7 into datamol-io:graphium_3.0 Jul 9, 2024
3 of 5 checks passed
@DomInvivo DomInvivo mentioned this pull request Jul 15, 2024
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants