Added and integrated C++ graphium_cpp library, a Python module implem… #510

DomInvivo · 2024-04-16T14:10:29Z

Implented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.

Changelogs

Move all the molecular featurization (atoms, bonds, positional encodings, etc.) to C++
Enable dataloading directly from Smiles during runtime
Improve memory + speed of dataloading by >10X

Authors: Most changes from @ndickson-nvidia , with some minor adjustment from @DomInvivo

discussion related to that PR

That PR will allow Graphium to perform much much faster, and unlock a new usage of positional encodings since they won't be a bottleneck anymore. Smiles -> pyg graph + pos encodings will now be done directly during dataloading.

…ented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.

DomInvivo

Thanks a lot for the work there! It's pretty impressive!

I tried to check the code as thoroughly as possible, but I will go through it again once these first comments are addressed. I haven't checked the accuracy of the C++ code for positional encodings.

One recurrent comment is that we could potentially discard any possibility of featurizing with Python, and only support C++ featurization. Otherwise, everything will be much slower since dataloading and transformation is now done in real-time, and any option to parallelize and cache the featurization is now gone.

env.yml

expts/hydra-configs/architecture/toymix.yaml

graphium/config/_loader.py

DomInvivo · 2024-04-16T14:24:03Z

graphium/data/collate.py

+                if "features" in elem:
+                    num_nodes = [d["features"].num_nodes for d in elements]
+                    num_edges = [d["features"].num_edges for d in elements]
+                else:
+                    num_nodes = [d["num_nodes"] for d in elements]
+                    num_edges = [d["num_edges"] for d in elements]


I think this can be moved outside of the for key in elem:

Done. Because we know that the only possible keys in MultitaskDataset are now "labels", "features", "num_nodes", and "num_edges", is it safe to remove support for other keys and just access these 4 directly, instead of looping over the keys of one item?

I think it's better to keep support for any key, in case a user wants to use other custom keys. Also, aren't some positional encodings using other keys?

All feature data's under "features" and all label data's under "labels". MultitaskDataset::__getitem__ does:

if self.mol_file_data_offsets is None: datum = { "features": self.featurize_smiles(smiles_str) } else: datum = { "labels": self.load_graph_from_index(idx), "features": self.featurize_smiles(smiles_str), }

and returns datum after an error check.

Ahh yeah, you're right. Still, I think collation should be flexible, and call back to default_collate for anything not features or labels

graphium/data/datamodule.py

graphium/graphium_cpp/features.cpp

graphium/graphium_cpp/random_walk.cpp

DomInvivo · 2024-04-16T21:07:06Z

graphium/graphium_cpp/spectral.cpp

+        // TODO: Decide what to do about legitimately complex eigenvalues.
+        // This should only occur with Normalization::INVERSE, because real, symmetric
+        // matrices have real eigenvalues.
+        // For now, just assume that they're supposed to be real and were only complex
+        // due to roundoff.


Eigenvectors of every graph Laplacian are never complex, even if D-1 L is non-symmetric.

The normalized graph Laplacian matrix, typically denoted as ( L_{\text{norm}} ), is defined as:

[ L_{\text{norm}} = D^{-1} L ]

where ( D ) is the degree matrix (a diagonal matrix with the degree of each vertex on the diagonal), ( L ) is the Laplacian matrix defined as ( L = D - A ) (with ( A ) being the adjacency matrix of the graph), and ( D^{-1} ) is the inverse of the degree matrix.

For an undirected graph, the matrix ( L_{\text{norm}} = D^{-1} L ) can be written as:

[ L_{\text{norm}} = I - D^{-1} A ]

where ( I ) is the identity matrix. Here, ( D^{-1} A ) can be thought of as the matrix of transition probabilities for a random walk on the graph, and this matrix is not necessarily symmetric. However, ( L_{\text{norm}} ) is similar to the symmetric matrix ( D^{-1/2} L D^{-1/2} ) (which is another common form of the normalized Laplacian often denoted as ( \mathcal{L} )). The similarity transformation is given by:

[ L_{\text{norm}} = D^{-1/2} \mathcal{L} D^{1/2} ]

This implies that ( L_{\text{norm}} ) and ( \mathcal{L} ) have the same eigenvalues, and the eigenvectors of ( L_{\text{norm}} ) are related to the eigenvectors of ( \mathcal{L} ) by a scaling of ( D^{-1/2} ). Since ( \mathcal{L} ) is symmetric, its eigenvectors can be chosen to be real and orthogonal.

Therefore, the eigenvectors of ( L_{\text{norm}} ) can also be chosen to be real. This conclusion holds as long as the graph is undirected and the degree matrix ( D ) does not contain any zeros (i.e., there are no isolated vertices). Thus, in typical scenarios where the graph is undirected and connected, ( L_{\text{norm}} ) does not have complex eigenvectors.

We could also use this to our advantage. If inverse is asked, we could just compute the regular eigvecs/eigvals, then rescale the eigvecs, thus allowing us to use eigh

Thanks for the explanation! Yeah, it would be good if the code can exclusively use eigh, to avoid the issues I was hitting with eig sometimes producing complex eigenvectors with non-zero imaginary components.

I've now implemented the inverse case in terms of eigh and removed the complex eigenvector and eigenvalue cases, because the documentation for eigh says that the types should be the same as the input types. I haven't tested it, but the idea is in there.

DomInvivo · 2024-04-16T21:07:51Z

graphium/graphium_cpp/spectral.cpp

+        }
+    }
+    else if (eigenvector_tensor.scalar_type() == c10::ScalarType::ComplexDouble) {
+        // TODO: Decide what to do about legitimately complex eigenvectors.


See comment above

tests/test_collate.py

…kDataset is still used in test_dataset.py, but won't after it's changed in a later commit.

…ng from the Python featurization yet), removed option to featurize to GraphData class instead of PyG Data class, added deprecation warnings to datamodule.py for parameters that are now unused, some cleanup in MultitaskFromSmilesDataModule::__init__, changed tensor index variables to properties, added preprocessing_n_jobs (not yet used), etc.

…save_data

… symmetric diagonalization, avoiding the need to handle complex eigenvectors and eigenvalues

… match behaviour from get_simple_mol_conformer Python code, but adding Hs, as recommended for conformer generation.

…catenate_strings function, though it's not used yet.

DomInvivo

A few minor comments left. Sorry that I contradicted myself on the deprecation, I think we can just say that this release of graphium 2.0 is not backward compatible, and avoid making the code ugly.

DomInvivo · 2024-05-03T02:21:13Z

graphium/data/datamodule.py

@@ -790,8 +788,7 @@ class MultitaskFromSmilesDataModule(BaseDataModule, IPUDataModuleModifier):
    def __init__(
        self,
        task_specific_args: Union[Dict[str, DatasetProcessingParams], Dict[str, Any]],
-        processed_graph_data_path: Optional[Union[str, os.PathLike]] = None,
-        dataloading_from: str = "ram",


I hate to contradict myself, but if we release a version 2.0, with the expectations that things break, can we just remove this options without the deprecation warning? I would keep a note in the docstring.

graphium/data/datamodule.py

DomInvivo · 2024-05-03T02:38:21Z

graphium/graphium_cpp/features.cpp

+    MaskNaNStyle mask_nan_style = MaskNaNStyle(mask_nan_style_int);
+    int64_t num_nans = 0;
+    int64_t nan_tensor_index = -1;
+    if (dtype == c10::ScalarType::Half) {


Is it necessary to repeat the exact same if/else if statement 3 times, except for the <int16_t> or <float> or <double>? Can you use a variable for that?

My C++ is a bit rusty, it has been years.

C++ templates (e.g. the <int16_t> or <float> or <double>) need to be able to be evaluated at compile-time, whereas the if statements are evaluated at runtime. These are the if statements that check the type at runtime once, so that it can be encoded in the template, and then it doesn't need to be evaluated at runtime again.

…r used

…changed create_all_features to create all tensors even if there are nans, so that the number of atoms can still be determined from the shape of the atom features tensor. Changed parse_mol to default to not reordering atoms, to match test order.

…e.py, keeping the notes in the function comments

…nodepair.py

…ing for now

…sions

…k` in `TransformerEncoder`

… if not using IPUs

Fixing unittest + mkdocs errors

DomInvivo

Reviewed all files. Ready to merge!

Added and integrated C++ graphium_cpp library, a Python module implem…

5ffe261

…ented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.

DomInvivo commented Apr 16, 2024

View reviewed changes

ndickson-nvidia added 11 commits April 17, 2024 11:17

Small changes to support not needing label data during data loading

8286383

Removed FakeDataset, FakeDataModule, and SingleTaskDataset. SingleTas…

dca9b2b

…kDataset is still used in test_dataset.py, but won't after it's changed in a later commit.

Removed newly deprecated options from yaml files

4ee35d4

Added support for limiting the number of threads used by prepare_and_…

cf23e37

…save_data

Fixed compiler warning about signed vs. unsigned comparison

5db0e2a

Fixed Python syntax issues

c75a452

Changed asymmetric inverse normalization type to be implemented using…

4aa1f85

… symmetric diagonalization, avoiding the need to handle complex eigenvectors and eigenvalues

Fixed compile errors

c53451a

Some simplification in collate.py

268e245

Deleting most of the Python featurization code

e032e8e

DomInvivo changed the base branch from main to graphium_2.0 April 22, 2024 14:55

ndickson-nvidia added 4 commits April 23, 2024 16:04

Implemented conformer generation in get_conformer_features, trying to…

bdefe89

… match behaviour from get_simple_mol_conformer Python code, but adding Hs, as recommended for conformer generation.

Deleted deprecated properties.py

5298444

Handle case of no label data in prepare_and_save_data. Also added con…

c38aa06

…catenate_strings function, though it's not used yet.

Changed prepare_data to support having no label data

86abf21

DomInvivo mentioned this pull request May 1, 2024

Remove forced constraint to cuda 11.2 with Graphium 3.0 #512

Open

Updated license passed to setup call in setup.py

80276da

DomInvivo commented May 3, 2024

View reviewed changes

ndickson-nvidia added 9 commits May 6, 2024 16:43

Changes to get test_dataset.py and test_multitask_datamodule.py passing

9492e62

Removed load_type option from test_training.py, because it's no longe…

d94097c

…r used

Updated comment in setup.py about how to build graphium_cpp package

11e6935

Removed deprecation warnings and deprecated parameters from datamodul…

a892068

…e.py, keeping the notes in the function comments

Recommended tweaks to extract_labels in multilevel_utils.py

38a5510

Fixed "else if"->"elif"

f7771b3

Rewrote test_pe_nodepair.py to use graphium_cpp

4256839

Rewrote test_pe_rw.py to use graphium_cpp. Comment update in test_pe_…

91c37a3

…nodepair.py

DomInvivo marked this pull request as ready for review June 5, 2024 04:18

ndickson-nvidia added 9 commits June 5, 2024 12:22

Hopefully explicitly installing graphium_cpp fixes the automated test…

77d27b5

…ing for now

Test fix

cb1df19

Another test fix

f3f6a0d

Another test fix

c5c0085

Make sure RDKit can find Boost headers

6dd827f

Reimplemented test_pos_transfer_funcs.py to test all supported conver…

59c84a2

…sions

Linting fixes

7bc8ade

Fixed collections.abs.Callable to typing.Callable for type hint

6903243

Removed file_opener and its test

9f38afb

DomInvivo deleted the branch datamol-io:graphium_3.0 June 27, 2024 21:21

DomInvivo closed this Jun 27, 2024

DomInvivo reopened this Jun 27, 2024

DomInvivo changed the base branch from graphium_2.0 to graphium_3.0 June 27, 2024 21:23

DomInvivo and others added 12 commits July 9, 2024 14:10

Fixed the issue with boolean masking, introduced by `F._canonical_mas…

5ab9ca9

…k` in `TransformerEncoder`

Fixed the float vs double issue in laplacian pos encoding

9c7504f

Added comment

f8358f3

Fixed the ipu tests by making sure that IPUStrategy is not imported…

692decc

… if not using IPUs

Update test.yml to only test python 3.10

8891e66

Removed positional encodings from the docs

c2d3c87

Merge remote-tracking branch 'origin/dom_unittest' into dom_unittest

d3d19d7

Upgraded python versions in the tests

0a1696f

Removed reference to old files now in C++

50265df

Downgraded python version

58fc2aa

Fixed other docs broken references

5852467

Merge pull request #1 from ndickson-nvidia/dom_unittest

ea9a775

Fixing unittest + mkdocs errors

DomInvivo assigned ndickson-nvidia Jul 9, 2024

DomInvivo commented Jul 9, 2024

View reviewed changes

DomInvivo merged commit 7f933b7 into datamol-io:graphium_3.0 Jul 9, 2024
3 of 5 checks passed

DomInvivo mentioned this pull request Jul 15, 2024

Graphium 3.0 #519

Draft

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added and integrated C++ graphium_cpp library, a Python module implem… #510

Added and integrated C++ graphium_cpp library, a Python module implem… #510

DomInvivo commented Apr 16, 2024 •

edited

Loading

DomInvivo left a comment

DomInvivo Apr 16, 2024

ndickson-nvidia Apr 19, 2024

DomInvivo Apr 23, 2024

ndickson-nvidia Apr 23, 2024

DomInvivo Apr 23, 2024

DomInvivo Apr 16, 2024

DomInvivo Apr 16, 2024

ndickson-nvidia Apr 16, 2024

ndickson-nvidia Apr 18, 2024

DomInvivo Apr 16, 2024

ndickson-nvidia Apr 18, 2024

DomInvivo left a comment

DomInvivo May 3, 2024

DomInvivo May 3, 2024

ndickson-nvidia May 23, 2024

DomInvivo left a comment

Added and integrated C++ graphium_cpp library, a Python module implem… #510

Added and integrated C++ graphium_cpp library, a Python module implem… #510

Conversation

DomInvivo commented Apr 16, 2024 • edited Loading

Changelogs

DomInvivo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DomInvivo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DomInvivo left a comment

Choose a reason for hiding this comment

DomInvivo commented Apr 16, 2024 •

edited

Loading