Skip to content

Commit

Permalink
Merge pull request #432 from datamol-io/caching
Browse files Browse the repository at this point in the history
Caching logic improvement
  • Loading branch information
DomInvivo authored Aug 18, 2023
2 parents beaf954 + cc91bfa commit 4adaaf7
Show file tree
Hide file tree
Showing 43 changed files with 361 additions and 220 deletions.
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,23 @@ graphium-train --config-path [PATH] --config-name [CONFIG]
```
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.

## Preparing the data in advance
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.

However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.

The following command-line will prepare the data and cache it, then use it to train a model.
```bash
# First prepare the data and cache it in `path_to_cached_data`
graphium-prepare-data datamodule.args.processed_graph_data_path=[path_to_cached_data]

# Then train the model on the prepared data
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
```

**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.

**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.

## License

Expand Down
22 changes: 11 additions & 11 deletions docs/tutorials/feature_processing/choosing_parallelization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 3,
"id": "b5df2ac6-2ded-4597-a445-f2b5fb106330",
"metadata": {
"tags": []
Expand All @@ -24,8 +24,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"INFO: Pandarallel will run on 240 workers.\n",
"INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.\n"
"The autoreload extension is already loaded. To reload it, use:\n",
" %reload_ext autoreload\n"
]
}
],
Expand All @@ -39,9 +39,9 @@
"import datamol as dm\n",
"import pandas as pd\n",
"\n",
"from pandarallel import pandarallel\n",
"# from pandarallel import pandarallel\n",
"\n",
"pandarallel.initialize(progress_bar=True, nb_workers=joblib.cpu_count())"
"# pandarallel.initialize(progress_bar=True, nb_workers=joblib.cpu_count())"
]
},
{
Expand All @@ -54,7 +54,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 4,
"id": "0f31e18d-bdd9-4d9b-8ba5-81e5887b857e",
"metadata": {
"tags": []
Expand All @@ -70,7 +70,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 7,
"id": "a1197c31-7dbc-4fd7-a69a-5215e1a96b8e",
"metadata": {
"tags": []
Expand Down Expand Up @@ -109,7 +109,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 10,
"id": "2f8ce5c3-4232-4279-8ea3-7a74832303be",
"metadata": {
"tags": []
Expand All @@ -129,7 +129,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 11,
"id": "a246cdcf-b5ea-4c9e-9ccc-dd3c544587bb",
"metadata": {
"tags": []
Expand All @@ -138,7 +138,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "3e939cd3a24742038b804bbfd961377d",
"model_id": "cc396220c7144c8d8b195fb87694bbfe",
"version_major": 2,
"version_minor": 0
},
Expand Down Expand Up @@ -489,7 +489,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
"version": "3.10.12"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
Expand Down
1 change: 0 additions & 1 deletion expts/configs/config_gps_10M_pcqm4m.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 0 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
1 change: 0 additions & 1 deletion expts/configs/config_gps_10M_pcqm4m_mod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@ datamodule:
# Data handling-related
batch_size_training: 64
batch_size_inference: 16
# cache_data_path: .
num_workers: 0 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
2 changes: 1 addition & 1 deletion expts/configs/config_mpnn_10M_b3lyp.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: "../datacache/b3lyp/"
dataloading_from: ram
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
Expand Down Expand Up @@ -123,7 +124,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 0 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
3 changes: 1 addition & 2 deletions expts/configs/config_mpnn_pcqm4m.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ datamodule:
featurization_n_jobs: 20
featurization_progress: True
featurization_backend: "loky"
cache_data_path: "./datacache"
processed_graph_data_path: "graphium/data/PCQM4Mv2/"
dataloading_from: ram
featurization:
# OGB: ['atomic_num', 'degree', 'possible_formal_charge', 'possible_numH' (total-valence),
# 'possible_number_radical_e', 'possible_is_aromatic', 'possible_is_in_ring',
Expand All @@ -58,7 +58,6 @@ datamodule:
# Data handling-related
batch_size_training: 64
batch_size_inference: 16
# cache_data_path: .
num_workers: 40 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
1 change: 1 addition & 0 deletions expts/hydra-configs/architecture/toymix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ datamodule:
featurization_progress: True
featurization_backend: "loky"
processed_graph_data_path: "../datacache/neurips2023-small/"
dataloading_from: ram
num_workers: 30 # -1 to use all
persistent_workers: False
featurization:
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/base_config/large.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 32 # -1 to use all
persistent_workers: True # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/base_config/small.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 5 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/config_luis_jama.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 4 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
1 change: 0 additions & 1 deletion expts/neurips2023_configs/debug/config_debug.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 0 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
3 changes: 1 addition & 2 deletions expts/neurips2023_configs/debug/config_large_gcn_debug.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down Expand Up @@ -327,7 +326,7 @@ predictor:
l1000_mcf7: []
pcba_1328: []
pcqm4m_g25: []
pcqm4m_n4: []
pcqm4m_n4: []
loss_fun:
l1000_vcap:
name: hybrid_ce_ipu
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ datamodule:
pos_type: rw_return_probs
ksteps: 16

# cache_data_path: .
num_workers: 30 # -1 to use all
persistent_workers: False # if use persistent worker at the start of each epoch.
# Using persistent_workers false might make the start of each epoch very long.
Expand Down
Loading

0 comments on commit 4adaaf7

Please sign in to comment.