-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added and integrated C++ graphium_cpp library, a Python module implem… #510
Added and integrated C++ graphium_cpp library, a Python module implem… #510
Conversation
…ented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the work there! It's pretty impressive!
I tried to check the code as thoroughly as possible, but I will go through it again once these first comments are addressed. I haven't checked the accuracy of the C++ code for positional encodings.
One recurrent comment is that we could potentially discard any possibility of featurizing with Python, and only support C++ featurization. Otherwise, everything will be much slower since dataloading and transformation is now done in real-time, and any option to parallelize and cache the featurization is now gone.
graphium/data/collate.py
Outdated
if "features" in elem: | ||
num_nodes = [d["features"].num_nodes for d in elements] | ||
num_edges = [d["features"].num_edges for d in elements] | ||
else: | ||
num_nodes = [d["num_nodes"] for d in elements] | ||
num_edges = [d["num_edges"] for d in elements] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be moved outside of the for key in elem:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Because we know that the only possible keys in MultitaskDataset
are now "labels"
, "features"
, "num_nodes"
, and "num_edges"
, is it safe to remove support for other keys and just access these 4 directly, instead of looping over the keys of one item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to keep support for any key, in case a user wants to use other custom keys. Also, aren't some positional encodings using other keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All feature data's under "features"
and all label data's under "labels"
. MultitaskDataset::__getitem__
does:
if self.mol_file_data_offsets is None:
datum = { "features": self.featurize_smiles(smiles_str) }
else:
datum = {
"labels": self.load_graph_from_index(idx),
"features": self.featurize_smiles(smiles_str),
}
and returns datum
after an error check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yeah, you're right. Still, I think collation should be flexible, and call back to default_collate
for anything not features
or labels
graphium/graphium_cpp/spectral.cpp
Outdated
// TODO: Decide what to do about legitimately complex eigenvalues. | ||
// This should only occur with Normalization::INVERSE, because real, symmetric | ||
// matrices have real eigenvalues. | ||
// For now, just assume that they're supposed to be real and were only complex | ||
// due to roundoff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eigenvectors of every graph Laplacian are never complex, even if D-1 L is non-symmetric.
The normalized graph Laplacian matrix, typically denoted as ( L_{\text{norm}} ), is defined as:
[ L_{\text{norm}} = D^{-1} L ]
where ( D ) is the degree matrix (a diagonal matrix with the degree of each vertex on the diagonal), ( L ) is the Laplacian matrix defined as ( L = D - A ) (with ( A ) being the adjacency matrix of the graph), and ( D^{-1} ) is the inverse of the degree matrix.
For an undirected graph, the matrix ( L_{\text{norm}} = D^{-1} L ) can be written as:
[ L_{\text{norm}} = I - D^{-1} A ]
where ( I ) is the identity matrix. Here, ( D^{-1} A ) can be thought of as the matrix of transition probabilities for a random walk on the graph, and this matrix is not necessarily symmetric. However, ( L_{\text{norm}} ) is similar to the symmetric matrix ( D^{-1/2} L D^{-1/2} ) (which is another common form of the normalized Laplacian often denoted as ( \mathcal{L} )). The similarity transformation is given by:
[ L_{\text{norm}} = D^{-1/2} \mathcal{L} D^{1/2} ]
This implies that ( L_{\text{norm}} ) and ( \mathcal{L} ) have the same eigenvalues, and the eigenvectors of ( L_{\text{norm}} ) are related to the eigenvectors of ( \mathcal{L} ) by a scaling of ( D^{-1/2} ). Since ( \mathcal{L} ) is symmetric, its eigenvectors can be chosen to be real and orthogonal.
Therefore, the eigenvectors of ( L_{\text{norm}} ) can also be chosen to be real. This conclusion holds as long as the graph is undirected and the degree matrix ( D ) does not contain any zeros (i.e., there are no isolated vertices). Thus, in typical scenarios where the graph is undirected and connected, ( L_{\text{norm}} ) does not have complex eigenvectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use this to our advantage. If inverse
is asked, we could just compute the regular eigvecs/eigvals, then rescale the eigvecs, thus allowing us to use eigh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Yeah, it would be good if the code can exclusively use eigh
, to avoid the issues I was hitting with eig
sometimes producing complex eigenvectors with non-zero imaginary components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now implemented the inverse case in terms of eigh
and removed the complex eigenvector and eigenvalue cases, because the documentation for eigh says that the types should be the same as the input types. I haven't tested it, but the idea is in there.
graphium/graphium_cpp/spectral.cpp
Outdated
} | ||
} | ||
else if (eigenvector_tensor.scalar_type() == c10::ScalarType::ComplexDouble) { | ||
// TODO: Decide what to do about legitimately complex eigenvectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…kDataset is still used in test_dataset.py, but won't after it's changed in a later commit.
…ng from the Python featurization yet), removed option to featurize to GraphData class instead of PyG Data class, added deprecation warnings to datamodule.py for parameters that are now unused, some cleanup in MultitaskFromSmilesDataModule::__init__, changed tensor index variables to properties, added preprocessing_n_jobs (not yet used), etc.
… symmetric diagonalization, avoiding the need to handle complex eigenvectors and eigenvalues
… match behaviour from get_simple_mol_conformer Python code, but adding Hs, as recommended for conformer generation.
…catenate_strings function, though it's not used yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments left. Sorry that I contradicted myself on the deprecation, I think we can just say that this release of graphium 2.0 is not backward compatible, and avoid making the code ugly.
@@ -790,8 +788,7 @@ class MultitaskFromSmilesDataModule(BaseDataModule, IPUDataModuleModifier): | |||
def __init__( | |||
self, | |||
task_specific_args: Union[Dict[str, DatasetProcessingParams], Dict[str, Any]], | |||
processed_graph_data_path: Optional[Union[str, os.PathLike]] = None, | |||
dataloading_from: str = "ram", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate to contradict myself, but if we release a version 2.0, with the expectations that things break, can we just remove this options without the deprecation warning? I would keep a note in the docstring.
MaskNaNStyle mask_nan_style = MaskNaNStyle(mask_nan_style_int); | ||
int64_t num_nans = 0; | ||
int64_t nan_tensor_index = -1; | ||
if (dtype == c10::ScalarType::Half) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to repeat the exact same if/else if statement 3 times, except for the <int16_t>
or <float>
or <double>
? Can you use a variable for that?
My C++ is a bit rusty, it has been years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ templates (e.g. the <int16_t>
or <float>
or <double>
) need to be able to be evaluated at compile-time, whereas the if statements are evaluated at runtime. These are the if statements that check the type at runtime once, so that it can be encoded in the template, and then it doesn't need to be evaluated at runtime again.
…changed create_all_features to create all tensors even if there are nans, so that the number of atoms can still be determined from the shape of the atom features tensor. Changed parse_mol to default to not reordering atoms, to match test order.
…e.py, keeping the notes in the function comments
…k` in `TransformerEncoder`
… if not using IPUs
Fixing unittest + mkdocs errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all files. Ready to merge!
Implented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.
Changelogs
Authors: Most changes from @ndickson-nvidia , with some minor adjustment from @DomInvivo
discussion related to that PR
That PR will allow Graphium to perform much much faster, and unlock a new usage of positional encodings since they won't be a bottleneck anymore.
Smiles -> pyg graph + pos encodings
will now be done directly during dataloading.