-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various updates to graphium #449
Merged
Changes from 27 commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
99ffea5
Adding largemix configs
c59bd63
Further improving caching
290e3da
Computing val/test metrics on cpu to save gpu memory
ea9ff5d
Implementing testing for a model checkpoint
fd6e932
Adding single dataset configs for LargeMix to hydra
WenkelF 77a28a0
Adding script for test sweep
af948e1
scripts for sbatch
7b0fdf3
correction
17a92fe
Switching to V100
00cedf9
Train script
fe4ead7
Changing back to cudatoolkit in env.yml
d6f4ae0
Updating test sweep
2a4a129
Updating test sweep
9ba7cae
Adding single run scripts
b5d897e
Updating pdba runs
952a145
Updating pcba runs
38e7667
Changing split_names to test-seen
914cb08
Updating split_paths to test_seen
bd3784c
Updating test_splits to test_seen
2ed5f1e
Removing scripts
eb037de
Minor reorganization
cf88c3f
Enabling to resume training
5781dcb
Finalizing largemix and large single dataset configs for hydra
93d12be
Temporary removing code for resuming training in favor of dedicated pr
8a80c52
Cleaning up
67e4134
Reformatting with black
53d10aa
Merge branch 'main' into caching
DomInvivo da31797
Added the date-time in the model_checkpoint paths
DomInvivo 44d241f
Removing graphium/cli/test.py in favor of graphium/cli/train_finetune…
8f1ddfb
Minor fix
3cf2fb5
Updating get_checkpoint_path
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# @package _global_ | ||
|
||
architecture: | ||
model_type: FullGraphMultiTaskNetwork | ||
mup_base_path: null | ||
pre_nn: # Set as null to avoid a pre-nn network | ||
out_dim: 64 | ||
hidden_dims: 256 | ||
depth: 2 | ||
activation: relu | ||
last_activation: none | ||
dropout: &dropout 0.1 | ||
normalization: &normalization layer_norm | ||
last_normalization: *normalization | ||
residual_type: none | ||
|
||
pre_nn_edges: null | ||
|
||
pe_encoders: | ||
out_dim: 32 | ||
pool: "sum" #"mean" "max" | ||
last_norm: None #"batch_norm", "layer_norm" | ||
encoders: #la_pos | rw_pos | ||
la_pos: # Set as null to avoid a pre-nn network | ||
encoder_type: "laplacian_pe" | ||
input_keys: ["laplacian_eigvec", "laplacian_eigval"] | ||
output_keys: ["feat"] | ||
hidden_dim: 64 | ||
out_dim: 32 | ||
model_type: 'DeepSet' #'Transformer' or 'DeepSet' | ||
num_layers: 2 | ||
num_layers_post: 1 # Num. layers to apply after pooling | ||
dropout: 0.1 | ||
first_normalization: "none" #"batch_norm" or "layer_norm" | ||
rw_pos: | ||
encoder_type: "mlp" | ||
input_keys: ["rw_return_probs"] | ||
output_keys: ["feat"] | ||
hidden_dim: 64 | ||
out_dim: 32 | ||
num_layers: 2 | ||
dropout: 0.1 | ||
normalization: "layer_norm" #"batch_norm" or "layer_norm" | ||
first_normalization: "layer_norm" #"batch_norm" or "layer_norm" | ||
|
||
gnn: # Set as null to avoid a post-nn network | ||
in_dim: 64 # or otherwise the correct value | ||
out_dim: &gnn_dim 768 | ||
hidden_dims: *gnn_dim | ||
depth: 4 | ||
activation: gelu | ||
last_activation: none | ||
dropout: 0.1 | ||
normalization: "layer_norm" | ||
last_normalization: *normalization | ||
residual_type: simple | ||
virtual_node: 'none' | ||
|
||
graph_output_nn: | ||
graph: | ||
pooling: [sum] | ||
out_dim: *gnn_dim | ||
hidden_dims: *gnn_dim | ||
depth: 1 | ||
activation: relu | ||
last_activation: none | ||
dropout: *dropout | ||
normalization: *normalization | ||
last_normalization: "none" | ||
residual_type: none | ||
node: | ||
pooling: [sum] | ||
out_dim: *gnn_dim | ||
hidden_dims: *gnn_dim | ||
depth: 1 | ||
activation: relu | ||
last_activation: none | ||
dropout: *dropout | ||
normalization: *normalization | ||
last_normalization: "none" | ||
residual_type: none | ||
|
||
datamodule: | ||
module_type: "MultitaskFromSmilesDataModule" | ||
args: | ||
prepare_dict_or_graph: pyg:graph | ||
featurization_n_jobs: 20 | ||
featurization_progress: True | ||
featurization_backend: "loky" | ||
processed_graph_data_path: "../datacache/large-dataset/" | ||
dataloading_from: "disk" | ||
num_workers: 20 # -1 to use all | ||
persistent_workers: True | ||
featurization: | ||
atom_property_list_onehot: [atomic-number, group, period, total-valence] | ||
atom_property_list_float: [degree, formal-charge, radical-electron, aromatic, in-ring] | ||
edge_property_list: [bond-type-onehot, stereo, in-ring] | ||
add_self_loop: False | ||
explicit_H: False # if H is included | ||
use_bonds_weights: False | ||
pos_encoding_as_features: # encoder dropout 0.18 | ||
pos_types: | ||
lap_eigvec: | ||
pos_level: node | ||
pos_type: laplacian_eigvec | ||
num_pos: 8 | ||
normalization: "none" # nomrlization already applied on the eigen vectors | ||
disconnected_comp: True # if eigen values/vector for disconnected graph are included | ||
lap_eigval: | ||
pos_level: node | ||
pos_type: laplacian_eigval | ||
num_pos: 8 | ||
normalization: "none" # nomrlization already applied on the eigen vectors | ||
disconnected_comp: True # if eigen values/vector for disconnected graph are included | ||
rw_pos: # use same name as pe_encoder | ||
pos_level: node | ||
pos_type: rw_return_probs | ||
ksteps: 16 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# @package _global_ | ||
|
||
architecture: | ||
pre_nn_edges: # Set as null to avoid a pre-nn network | ||
out_dim: 32 | ||
hidden_dims: 128 | ||
depth: 2 | ||
activation: relu | ||
last_activation: none | ||
dropout: ${architecture.pre_nn.dropout} | ||
normalization: ${architecture.pre_nn.normalization} | ||
last_normalization: ${architecture.pre_nn.normalization} | ||
residual_type: none | ||
|
||
gnn: | ||
out_dim: &gnn_dim 704 | ||
hidden_dims: *gnn_dim | ||
layer_type: 'pyg:gine' | ||
|
||
graph_output_nn: | ||
graph: | ||
out_dim: *gnn_dim | ||
hidden_dims: *gnn_dim | ||
node: | ||
out_dim: *gnn_dim | ||
hidden_dims: *gnn_dim |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# NOTE: We cannot have a single config, since for fine-tuning we will | ||
# only want to override the loss_metrics_datamodule, whereas for training we will | ||
# want to override both. | ||
|
||
defaults: | ||
- task_heads: l1000_mcf7 | ||
- loss_metrics_datamodule: l1000_mcf7 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# NOTE: We cannot have a single config, since for fine-tuning we will | ||
# only want to override the loss_metrics_datamodule, whereas for training we will | ||
# want to override both. | ||
|
||
defaults: | ||
- task_heads: l1000_vcap | ||
- loss_metrics_datamodule: l1000_vcap |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# NOTE: We cannot have a single config, since for fine-tuning we will | ||
# only want to override the loss_metrics_datamodule, whereas for training we will | ||
# want to override both. | ||
|
||
defaults: | ||
- task_heads: largemix | ||
- loss_metrics_datamodule: largemix |
49 changes: 49 additions & 0 deletions
49
expts/hydra-configs/tasks/loss_metrics_datamodule/l1000_mcf7.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# @package _global_ | ||
|
||
predictor: | ||
metrics_on_progress_bar: | ||
l1000_mcf7: [] | ||
metrics_on_training_set: | ||
l1000_mcf7: [] | ||
loss_fun: | ||
l1000_mcf7: | ||
name: hybrid_ce_ipu | ||
n_brackets: 3 | ||
alpha: 0.5 | ||
|
||
metrics: | ||
l1000_mcf7: | ||
- name: auroc | ||
metric: auroc | ||
num_classes: 3 | ||
task: multiclass | ||
target_to_int: True | ||
target_nan_mask: -1000 | ||
ignore_index: -1000 | ||
multitask_handling: mean-per-label | ||
threshold_kwargs: null | ||
- name: avpr | ||
metric: averageprecision | ||
num_classes: 3 | ||
task: multiclass | ||
target_to_int: True | ||
target_nan_mask: -1000 | ||
ignore_index: -1000 | ||
multitask_handling: mean-per-label | ||
threshold_kwargs: null | ||
|
||
datamodule: | ||
args: # Matches that in the test_multitask_datamodule.py case. | ||
task_specific_args: # To be replaced by a new class "DatasetParams" | ||
l1000_mcf7: | ||
df: null | ||
df_path: ../data/graphium/large-dataset/LINCS_L1000_MCF7_0-2_th2.csv.gz | ||
# wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/LINCS_L1000_MCF7_0-4.csv.gz | ||
# or set path as the URL directly | ||
smiles_col: "SMILES" | ||
label_cols: geneID-* # geneID-* means all columns starting with "geneID-" | ||
# sample_size: 2000 # use sample_size for test | ||
task_level: graph | ||
splits_path: ../data/graphium/large-dataset/l1000_mcf7_random_splits.pt # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/l1000_mcf7_random_splits.pt` | ||
# split_names: [train, val, test_seen] | ||
epoch_sampling_fraction: 1.0 |
49 changes: 49 additions & 0 deletions
49
expts/hydra-configs/tasks/loss_metrics_datamodule/l1000_vcap.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# @package _global_ | ||
|
||
predictor: | ||
metrics_on_progress_bar: | ||
l1000_vcap: [] | ||
metrics_on_training_set: | ||
l1000_vcap: [] | ||
loss_fun: | ||
l1000_vcap: | ||
name: hybrid_ce_ipu | ||
n_brackets: 3 | ||
alpha: 0.5 | ||
|
||
metrics: | ||
l1000_vcap: | ||
- name: auroc | ||
metric: auroc | ||
num_classes: 3 | ||
task: multiclass | ||
target_to_int: True | ||
target_nan_mask: -1000 | ||
ignore_index: -1000 | ||
multitask_handling: mean-per-label | ||
threshold_kwargs: null | ||
- name: avpr | ||
metric: averageprecision | ||
num_classes: 3 | ||
task: multiclass | ||
target_to_int: True | ||
target_nan_mask: -1000 | ||
ignore_index: -1000 | ||
multitask_handling: mean-per-label | ||
threshold_kwargs: null | ||
|
||
datamodule: | ||
args: # Matches that in the test_multitask_datamodule.py case. | ||
task_specific_args: # To be replaced by a new class "DatasetParams" | ||
l1000_vcap: | ||
df: null | ||
df_path: ../data/graphium/large-dataset/LINCS_L1000_VCAP_0-2_th2.csv.gz | ||
# wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/LINCS_L1000_VCAP_0-4.csv.gz | ||
# or set path as the URL directly | ||
smiles_col: "SMILES" | ||
label_cols: geneID-* # geneID-* means all columns starting with "geneID-" | ||
# sample_size: 2000 # use sample_size for test | ||
task_level: graph | ||
splits_path: ../data/graphium/large-dataset/l1000_vcap_random_splits.pt # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Large-dataset/l1000_vcap_random_splits.pt` | ||
# split_names: [train, val, test_seen] | ||
epoch_sampling_fraction: 1.0 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those yaml files seem like duplicate from the files in the
neurips2023
folder. What do you think? Is there a reason to have them separately?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I curated the configs to make LargeMix configs easy to use within our hydra logic. In particular, the modularity with the added files will facilitate finetuning on pretrained models on LargeMix. As far as I see, neurips2023 configs can only be used for training on LargeMix, but one needs to write new configs for, e.g., finetuning or preparing data.