-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graphium 3.0 #519
Draft
DomInvivo
wants to merge
189
commits into
main
Choose a base branch
from
graphium_3.0
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Graphium 3.0 #519
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ented in C++ for featurization and preprocessing optimizations, along with a few other optimizations, significantly reducing memory usage, disk usage, and processing time for large datasets.
…kDataset is still used in test_dataset.py, but won't after it's changed in a later commit.
…ng from the Python featurization yet), removed option to featurize to GraphData class instead of PyG Data class, added deprecation warnings to datamodule.py for parameters that are now unused, some cleanup in MultitaskFromSmilesDataModule::__init__, changed tensor index variables to properties, added preprocessing_n_jobs (not yet used), etc.
… symmetric diagonalization, avoiding the need to handle complex eigenvectors and eigenvalues
… match behaviour from get_simple_mol_conformer Python code, but adding Hs, as recommended for conformer generation.
…catenate_strings function, though it's not used yet.
…le with `torchmetrics`
…ents on the `Predictor` for Todos
…changed create_all_features to create all tensors even if there are nans, so that the number of atoms can still be determined from the shape of the atom features tensor. Changed parse_mol to default to not reordering atoms, to match test order.
…e.py, keeping the notes in the function comments
…_laplacian_eigendecomp to accept `const int32_t*` instead of `const std::vector<int32_t>*`
This was referenced Sep 23, 2024
…m into atom_order
Reorder atoms in label data (Fixes 502) + documenting C++
Remove IPU (again)
Integrate finetuning & fingerprinting
PR for packaging tasks here: #531 |
Added C++ file description comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelogs
Moving to Graphium 3.0! This will be a large PR that basically regroups many other PRs there. So for specific changes, please consult the original PRs.
discussion related to that PR
Todo's before merging
Before merging Graphium 3.0, we need to do the following tests.
Validating the C++
Assigned to @WenkelF and @AnujaSomthankar, with @ndickson-nvidia to help fix issues if any
The C++ changes were brought by the PR #510 , and there were a lot of unit-tests and validation using the ToyMix dataset. What's left to be done is validating that we can reproduce the experimental results of pre-training a MolGPS model.
Validate that training is fasterTraining is not faster on our cluster, but expected to be faster where disk reading is limiting the speed.Validating the torchmetrics
Assigned to @WenkelF and @AnujaSomthankar, with @DomInvivo to help fix issues if any
Improving the cuda support
Assigned to @WenkelF and @AnujaSomthankar
cuda-version
and remove restriction to 11.2. Close Remove forced constraint to cuda 11.2 with Graphium 3.0 #512Fixing the node ordering issue
Assigned to @ndickson-nvidia , see PR #521
Support for Mixed precision
Supporting mixed precision should be easy with lightning. However, we face issues that some of the tasks are very sparse and require float32. What we suggest is to have a custom mixed-precision that doesn't affect the task heads, but only the body of the GNN network.
[ ] Implement custom mixed precision.Removing the IPU support #525
Assigned to @DomInvivo
Since Graphcore is no longer maintaining IPU support in lightning, it is best to remove it from Graphium 3.0. It will stay compatible with 2.0, and can be brought back if necessary, afterwards. (We got the approval from GraphCore for this)
Command line
Assigned to @WenkelF
Packaging
Assigned to @Andrewq11
[ ] Make sure that the package can be installed via pipthe pip package doesn't have headers, so we can't release there.graphium
andgraphium_cpp
as 2 different packages[ ] Support numpy >= 2.0Need to wait for Pyg to supportnumpy>=2.0
For now, we are constrained by the rdkit version == 2024.03.4 due to missing headers in more recent releases. And also can only support conda due to missing headers for pip.
Polaris
[ ] Add data download from PolarisNot for initial releaseLinting
black
linting on the code. Wait for last-minute to avoid cluttering the PR.2024-11-14 update of what's missing