Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added and integrated C++ graphium_cpp library, a Python module implem… #510

Merged
merged 54 commits into from
Jul 9, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
5ffe261
Added and integrated C++ graphium_cpp library, a Python module implem…
ndickson-nvidia Apr 13, 2024
8286383
Small changes to support not needing label data during data loading
ndickson-nvidia Apr 17, 2024
dca9b2b
Removed FakeDataset, FakeDataModule, and SingleTaskDataset. SingleTa…
ndickson-nvidia Apr 17, 2024
8304210
Removed option to featurize using Python, (but didn't delete everythi…
ndickson-nvidia Apr 17, 2024
4ee35d4
Removed newly deprecated options from yaml files
ndickson-nvidia Apr 18, 2024
cf23e37
Added support for limiting the number of threads used by prepare_and_…
ndickson-nvidia Apr 18, 2024
5db0e2a
Fixed compiler warning about signed vs. unsigned comparison
ndickson-nvidia Apr 18, 2024
c75a452
Fixed Python syntax issues
ndickson-nvidia Apr 18, 2024
4aa1f85
Changed asymmetric inverse normalization type to be implemented using…
ndickson-nvidia Apr 18, 2024
c53451a
Fixed compile errors
ndickson-nvidia Apr 18, 2024
268e245
Some simplification in collate.py
ndickson-nvidia Apr 19, 2024
e032e8e
Deleting most of the Python featurization code
ndickson-nvidia Apr 19, 2024
bdefe89
Implemented conformer generation in get_conformer_features, trying to…
ndickson-nvidia Apr 23, 2024
5298444
Deleted deprecated properties.py
ndickson-nvidia Apr 23, 2024
c38aa06
Handle case of no label data in prepare_and_save_data. Also added con…
ndickson-nvidia Apr 25, 2024
86abf21
Changed prepare_data to support having no label data
ndickson-nvidia Apr 25, 2024
80276da
Updated license passed to setup call in setup.py
ndickson-nvidia May 2, 2024
9492e62
Changes to get test_dataset.py and test_multitask_datamodule.py passing
ndickson-nvidia May 6, 2024
d94097c
Removed load_type option from test_training.py, because it's no longe…
ndickson-nvidia May 6, 2024
11e6935
Updated comment in setup.py about how to build graphium_cpp package
ndickson-nvidia May 14, 2024
ff93c2d
Rewrote test_featurizer.py. Fixed bug in mask_nans C++ function, and …
ndickson-nvidia May 14, 2024
a892068
Removed deprecation warnings and deprecated parameters from datamodul…
ndickson-nvidia May 23, 2024
38a5510
Recommended tweaks to extract_labels in multilevel_utils.py
ndickson-nvidia May 23, 2024
f7771b3
Fixed "else if"->"elif"
ndickson-nvidia May 23, 2024
4256839
Rewrote test_pe_nodepair.py to use graphium_cpp
ndickson-nvidia May 24, 2024
91c37a3
Rewrote test_pe_rw.py to use graphium_cpp. Comment update in test_pe_…
ndickson-nvidia May 24, 2024
f347a0d
Rewrote test_pe_spectral.py to use graphium_cpp
ndickson-nvidia May 24, 2024
26b5531
Removed tests/test_positional_encodings.py, because it's a duplicate …
ndickson-nvidia May 24, 2024
1ded38b
Fixed handling of disconnected components vs. single component for la…
ndickson-nvidia May 28, 2024
314d636
Fixed compile warnings in one_hot.cpp
ndickson-nvidia May 28, 2024
e49b4da
Rewrote test_positional_encoders.py, though it's still failing the te…
ndickson-nvidia May 28, 2024
f001464
Removed commented out lines from setup.py
ndickson-nvidia Jun 4, 2024
2782fbc
Ran linting on Python files
ndickson-nvidia Jun 4, 2024
77d27b5
Hopefully explicitly installing graphium_cpp fixes the automated test…
ndickson-nvidia Jun 5, 2024
cb1df19
Test fix
ndickson-nvidia Jun 5, 2024
f3f6a0d
Another test fix
ndickson-nvidia Jun 5, 2024
c5c0085
Another test fix
ndickson-nvidia Jun 5, 2024
6dd827f
Make sure RDKit can find Boost headers
ndickson-nvidia Jun 5, 2024
59c84a2
Reimplemented test_pos_transfer_funcs.py to test all supported conver…
ndickson-nvidia Jun 12, 2024
7bc8ade
Linting fixes
ndickson-nvidia Jun 12, 2024
6903243
Fixed collections.abs.Callable to typing.Callable for type hint
ndickson-nvidia Jun 12, 2024
9f38afb
Removed file_opener and its test
ndickson-nvidia Jun 17, 2024
5ab9ca9
Fixed the issue with boolean masking, introduced by `F._canonical_mas…
DomInvivo Jul 9, 2024
9c7504f
Fixed the float vs double issue in laplacian pos encoding
DomInvivo Jul 9, 2024
f8358f3
Added comment
DomInvivo Jul 9, 2024
692decc
Fixed the ipu tests by making sure that `IPUStrategy` is not imported…
DomInvivo Jul 9, 2024
8891e66
Update test.yml to only test python 3.10
DomInvivo Jul 9, 2024
c2d3c87
Removed positional encodings from the docs
DomInvivo Jul 9, 2024
d3d19d7
Merge remote-tracking branch 'origin/dom_unittest' into dom_unittest
DomInvivo Jul 9, 2024
0a1696f
Upgraded python versions in the tests
DomInvivo Jul 9, 2024
50265df
Removed reference to old files now in C++
DomInvivo Jul 9, 2024
58fc2aa
Downgraded python version
DomInvivo Jul 9, 2024
5852467
Fixed other docs broken references
DomInvivo Jul 9, 2024
ea9a775
Merge pull request #1 from ndickson-nvidia/dom_unittest
ndickson-nvidia Jul 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@
Copyright 2023 Valence Labs
Copyright 2023 Recursion Pharmaceuticals
Copyright 2023 Graphcore Limited
Copyright 2024 NVIDIA CORPORATION & AFFILIATES

Various Academic groups have also contributed to this software under
the given license. These include, but are not limited, to the following
Expand Down
2 changes: 1 addition & 1 deletion env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ dependencies:
- gcsfs >=2021.6

# ML packages
- cuda-version # works also with CPU-only system.
- cuda-version == 11.2 # works also with CPU-only system.
DomInvivo marked this conversation as resolved.
Show resolved Hide resolved
- pytorch >=1.12
- lightning >=2.0
- torchmetrics >=0.7.0,<0.11
Expand Down
1 change: 0 additions & 1 deletion expts/configs/config_mpnn_10M_b3lyp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,6 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: "../datacache/b3lyp/"
dataloading_from: ram
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
Expand Down
1 change: 0 additions & 1 deletion expts/configs/config_mpnn_pcqm4m.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: "graphium/data/PCQM4Mv2/"
dataloading_from: ram
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
Expand Down
1 change: 0 additions & 1 deletion expts/hydra-configs/architecture/largemix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: ${constants.datacache_path}
dataloading_from: "disk"
num_workers: 20 # -1 to use all
persistent_workers: True
featurization:
Expand Down
2 changes: 1 addition & 1 deletion expts/hydra-configs/architecture/toymix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,10 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: ${constants.datacache_path}
dataloading_from: ram
num_workers: 30 # -1 to use all
persistent_workers: False
featurization:
use_graphium_cpp: True
DomInvivo marked this conversation as resolved.
Show resolved Hide resolved
atom_property_list_onehot: [atomic-number, group, period, total-valence]
atom_property_list_float: [degree, formal-charge, radical-electron, aromatic, in-ring]
edge_property_list: [bond-type-onehot, stereo, in-ring]
Expand Down
1 change: 0 additions & 1 deletion expts/hydra-configs/finetuning/admet_baseline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ constants:
datamodule:
args:
batch_size_training: 32
dataloading_from: ram
persistent_workers: true
num_workers: 4

Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/base_config/large.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,6 @@ datamodule:
featurization_n_jobs: 30
featurization_progress: True
featurization_backend: "loky"
dataloading_from: disk
processed_graph_data_path: ${constants.datacache_path}
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/base_config/large_pcba.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,6 @@ datamodule:
featurization_n_jobs: 30
featurization_progress: True
featurization_backend: "loky"
dataloading_from: disk
processed_graph_data_path: ${constants.datacache_path}
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,6 @@ datamodule:
featurization_n_jobs: 30
featurization_progress: True
featurization_backend: "loky"
dataloading_from: disk
processed_graph_data_path: ${constants.datacache_path}
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/base_config/large_pcqm_n4.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,6 @@ datamodule:
featurization_n_jobs: 30
featurization_progress: True
featurization_backend: "loky"
dataloading_from: disk
processed_graph_data_path: ${constants.datacache_path}
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
Expand Down
6 changes: 0 additions & 6 deletions graphium/config/_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,8 +203,6 @@ def load_architecture(
architecture: The datamodule used to process and load the data
"""

if isinstance(config, dict) and "finetuning" not in config:
DomInvivo marked this conversation as resolved.
Show resolved Hide resolved
config = omegaconf.OmegaConf.create(config)
cfg_arch = config["architecture"]

# Select the architecture
Expand Down Expand Up @@ -262,10 +260,6 @@ def load_architecture(
else:
gnn_kwargs.setdefault("in_dim", edge_in_dim)

# Set the parameters for the full network
if "finetuning" not in config:
task_heads_kwargs = omegaconf.OmegaConf.to_object(task_heads_kwargs)

# Set all the input arguments for the model
model_kwargs = dict(
gnn_kwargs=gnn_kwargs,
Expand Down
88 changes: 51 additions & 37 deletions graphium/data/collate.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
"""
--------------------------------------------------------------------------------
Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals and Graphcore.
Copyright (c) 2023 Valence Labs, Recursion Pharmaceuticals, Graphcore, and NVIDIA Corporation & Affiliates.

Use of this software is subject to the terms and conditions outlined in the LICENSE file.
Unauthorized modification, distribution, or use is prohibited. Provided 'as is' without
warranties of any kind.

Valence Labs, Recursion Pharmaceuticals and Graphcore are not liable for any damages arising from its use.
Valence Labs, Recursion Pharmaceuticals, Graphcore, and NVIDIA Corporation & Affiliates are not liable for any damages arising from its use.
Refer to the LICENSE file for the full terms and conditions.
--------------------------------------------------------------------------------
"""
Expand All @@ -26,11 +26,11 @@
from graphium.utils.packing import fast_packing, get_pack_sizes, node_to_pack_indices_mask
from loguru import logger
from graphium.data.utils import get_keys

from graphium.data.dataset import torch_enum_to_dtype

def graphium_collate_fn(
elements: Union[List[Any], Dict[str, List[Any]]],
labels_size_dict: Optional[Dict[str, Any]] = None,
labels_num_cols_dict: Optional[Dict[str, Any]] = None,
labels_dtype_dict: Optional[Dict[str, Any]] = None,
mask_nan: Union[str, float, Type[None]] = "raise",
do_not_collate_keys: List[str] = [],
Expand All @@ -52,7 +52,7 @@ def graphium_collate_fn(
elements:
The elements to batch. See `torch.utils.data.dataloader.default_collate`.

labels_size_dict:
labels_num_cols_dict:
(Note): This is an attribute of the `MultitaskDataset`.
A dictionary of the form Dict[tasks, sizes] which has task names as keys
and the size of the label tensor as value. The size of the tensor corresponds to how many
Expand Down Expand Up @@ -86,14 +86,26 @@ def graphium_collate_fn(
The batched elements. See `torch.utils.data.dataloader.default_collate`.
"""

# Skip any elements that failed
if None in elements:
elements = [e for e in elements if e is not None]

elem = elements[0]
if isinstance(elem, Mapping):
batch = {}
for key in elem:
# Multitask setting: We have to pad the missing labels
if key == "labels":
labels = [d[key] for d in elements]
batch[key] = collate_labels(labels, labels_size_dict, labels_dtype_dict)
if "features" in elem:
num_nodes = [d["features"].num_nodes for d in elements]
num_edges = [d["features"].num_edges for d in elements]
else:
num_nodes = [d["num_nodes"] for d in elements]
num_edges = [d["num_edges"] for d in elements]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved outside of the for key in elem:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Because we know that the only possible keys in MultitaskDataset are now "labels", "features", "num_nodes", and "num_edges", is it safe to remove support for other keys and just access these 4 directly, instead of looping over the keys of one item?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to keep support for any key, in case a user wants to use other custom keys. Also, aren't some positional encodings using other keys?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feature data's under "features" and all label data's under "labels". MultitaskDataset::__getitem__ does:

        if self.mol_file_data_offsets is None:
            datum = { "features": self.featurize_smiles(smiles_str) }
        else:
            datum = {
                "labels": self.load_graph_from_index(idx),
                "features": self.featurize_smiles(smiles_str),
            }

and returns datum after an error check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah, you're right. Still, I think collation should be flexible, and call back to default_collate for anything not features or labels

batch[key] = collate_labels(labels, labels_num_cols_dict, labels_dtype_dict, num_nodes, num_edges)
elif key == "num_nodes" or key == "num_edges":
continue

# If the features are a dictionary containing GraphDict elements,
# Convert to pyg graphs and use the pyg batching.
Expand Down Expand Up @@ -182,23 +194,21 @@ def collage_pyg_graph(pyg_graphs: Iterable[Union[Data, Dict]], batch_size_per_pa
return Batch.from_data_list(pyg_batch)


def pad_to_expected_label_size(labels: torch.Tensor, label_size: List[int]):
def pad_to_expected_label_size(labels: torch.Tensor, label_rows: int, label_cols: int):
"""Determine difference of ``labels`` shape to expected shape `label_size` and pad
with ``torch.nan`` accordingly.
"""
if label_size == list(labels.shape):
if len(labels.shape) == 2 and label_rows == labels.shape[0] and label_cols == labels.shape[1]:
return labels

missing_dims = len(label_size) - len(labels.shape)
missing_dims = 2 - len(labels.shape)
for _ in range(missing_dims):
labels.unsqueeze(-1)

pad_sizes = [(0, expected - actual) for expected, actual in zip(label_size, labels.shape)]
pad_sizes = [item for before_after in pad_sizes for item in before_after]
pad_sizes.reverse()
pad_sizes = [label_cols - labels.shape[1], 0, label_rows - labels.shape[0], 0]

if any([s < 0 for s in pad_sizes]):
logger.warning(f"More labels available than expected. Will remove data to fit expected size.")
logger.warning(f"More labels available than expected. Will remove data to fit expected size. cols: {labels.shape[1]}->{label_cols}, rows: {labels.shape[0]}->{label_rows}")

return torch.nn.functional.pad(labels, pad_sizes, value=torch.nan)

Expand Down Expand Up @@ -226,31 +236,41 @@ def collate_pyg_graph_labels(pyg_labels: List[Data]):
return Batch.from_data_list(pyg_batch)


def get_expected_label_size(label_data: Data, task: str, label_size: List[int]):
def get_expected_label_rows(
label_data: Data,
task: str,
num_nodes: int,
num_edges: int
):
"""Determines expected label size based on the specfic graph properties
and the number of targets in the task-dataset.
"""
if task.startswith("graph_"):
num_labels = 1
elif task.startswith("node_"):
num_labels = label_data.x.size(0)
num_labels = num_nodes
elif task.startswith("edge_"):
num_labels = label_data.edge_index.size(1)
num_labels = num_edges
elif task.startswith("nodepair_"):
raise NotImplementedError()
return [num_labels] + label_size
else:
print("Task name "+task+" in get_expected_label_rows")
raise NotImplementedError()
return num_labels


def collate_labels(
labels: List[Data],
labels_size_dict: Optional[Dict[str, Any]] = None,
labels_num_cols_dict: Optional[Dict[str, Any]] = None,
labels_dtype_dict: Optional[Dict[str, Any]] = None,
num_nodes: List[int] = None,
num_edges: List[int] = None
):
"""Collate labels for multitask learning.

Parameters:
labels: List of labels
labels_size_dict: Dict of the form Dict[tasks, sizes] which has task names as keys
labels_num_cols_dict: Dict of the form Dict[tasks, sizes] which has task names as keys
and the size of the label tensor as value. The size of the tensor corresponds to how many
labels/values there are to predict for that task.
labels_dtype_dict:
Expand All @@ -260,45 +280,39 @@ def collate_labels(

Returns:
A dictionary of the form Dict[tasks, labels] where tasks is the name of the task and labels
is a tensor of shape (batch_size, *labels_size_dict[task]).
is a tensor of shape (batch_size, *labels_num_cols_dict[task]).
"""
if labels_size_dict is not None:
for this_label in labels:
for task in labels_size_dict.keys():
labels_size_dict[task] = list(labels_size_dict[task])
if len(labels_size_dict[task]) >= 2:
labels_size_dict[task] = labels_size_dict[task][1:]
elif not task.startswith("graph_"):
labels_size_dict[task] = [1]
if labels_num_cols_dict is not None:
for index, this_label in enumerate(labels):
label_keys_set = set(get_keys(this_label))
empty_task_labels = set(labels_size_dict.keys()) - label_keys_set
empty_task_labels = set(labels_num_cols_dict.keys()) - label_keys_set
for task in empty_task_labels:
labels_size_dict[task] = get_expected_label_size(this_label, task, labels_size_dict[task])
dtype = labels_dtype_dict[task]
this_label[task] = torch.full([*labels_size_dict[task]], torch.nan, dtype=dtype)
label_rows = get_expected_label_rows(this_label, task, num_nodes[index], num_edges[index])
dtype = torch_enum_to_dtype(labels_dtype_dict[task])
this_label[task] = torch.full((label_rows, labels_num_cols_dict[task]), fill_value=torch.nan, dtype=dtype)

for task in label_keys_set - set(["x", "edge_index"]) - empty_task_labels:
labels_size_dict[task] = get_expected_label_size(this_label, task, labels_size_dict[task])
label_rows = get_expected_label_rows(this_label, task, num_nodes[index], num_edges[index])

if not isinstance(this_label[task], (torch.Tensor)):
this_label[task] = torch.as_tensor(this_label[task])

# Ensure explicit task dimension also for single task labels
if len(this_label[task].shape) == 1:
# Distinguish whether target dim or entity dim is missing
if labels_size_dict[task][0] == this_label[task].shape[0]:
if label_rows == this_label[task].shape[0]:
# num graphs/nodes/edges/nodepairs already matching
this_label[task] = this_label[task].unsqueeze(1)
else:
# data lost unless entity dim is supposed to be 1
if labels_size_dict[task][0] == 1:
if label_rows == 1:
this_label[task] = this_label[task].unsqueeze(0)
else:
raise ValueError(
f"Labels for {labels_size_dict[task][0]} nodes/edges/nodepairs expected, got 1."
f"Labels for {label_rows} nodes/edges/nodepairs expected, got 1."
)

this_label[task] = pad_to_expected_label_size(this_label[task], labels_size_dict[task])
this_label[task] = pad_to_expected_label_size(this_label[task], label_rows, labels_num_cols_dict[task])

return collate_pyg_graph_labels(labels)

Expand Down
Loading
Loading