Merge pull request #441 from cwognum/main

WIP: Various improvements to the CLI endpoints
datamol-io · Aug 29, 2023 · 3aa098b · 3aa098b
2 parents 0fdd458 + 3c9e327
commit 3aa098b
Show file tree

Hide file tree

Showing 18 changed files with 671 additions and 107 deletions.
diff --git a/.github/workflows/doc.yml b/.github/workflows/doc.yml
@@ -31,7 +31,9 @@ jobs:
           cache-downloads: true
 
       - name: Install library
-        run: python -m pip install --no-deps .
+        run: |
+          python -m pip install --no-deps .
+          pip install typer-cli
 
       - name: Configure git
         run: |
@@ -40,6 +42,10 @@ jobs:
 
       - name: Deploy the doc
         run: |
+
+          echo "Auto-generating typer docs"
+          typer graphium.cli.__main__ utils docs --name graphium --output docs/cli/graphium.md
+          
           echo "Get the gh-pages branch"
           git fetch origin gh-pages
 

diff --git a/README.md b/README.md
@@ -65,55 +65,9 @@ The above step needs to be done once. After that, enable the SDK and the environ
 source enable_ipu.sh .graphium_ipu
 ```
 
-## Training a model
+## The Graphium CLI
 
-To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).
-
-If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.
-
-## Running an experiment
-We have setup Graphium with `hydra` for managing config files. To run an experiment go to the `expts/` folder. For example, to benchmark a GCN on the ToyMix dataset run
-```bash
-graphium-train dataset=toymix model=gcn
-```
-To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
-```bash
-graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
-```
-or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
-Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
-```bash
-graphium-train dataset=toymix model=gcn accelerator=gpu
-```
-automatically selects the correct configs to run the experiment on GPU.
-Finally, you can also run a fine-tuning loop: 
-```bash
-graphium-train +finetuning=admet
-```
-
-To use a config file you built from scratch you can run
-```bash
-graphium-train --config-path [PATH] --config-name [CONFIG]
-```
-Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.
-
-## Preparing the data in advance
-The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
-
-However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
-
-The following command-line will prepare the data and cache it, then use it to train a model.
-```bash
-# First prepare the data and cache it in `path_to_cached_data`
-graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]
-
-# Then train the model on the prepared data
-graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
-```
-
-**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.
-
-**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
+Installing `graphium` makes two CLI tools available: `graphium` and `graphium-train`. These CLI tools make it easy to access advanced functionality, such as _training a model_,  _extracting fingerprints from a pre-trained model_ or _precomputing the dataset_. For more information, visit [the documentation](https://graphium-docs.datamol.io/stable/cli/reference.html).
 
 ## License
 

diff --git a/docs/api/graphium.finetuning.md b/docs/api/graphium.finetuning.md
@@ -0,0 +1,8 @@
+::: graphium.finetuning.fingerprinting.Fingerprinter
+    options: 
+        filters: ["!^_"]
+        separate_signature: true
+        show_signature_annotations: true
+        line_length: 80
+        merge_init_into_class: true 
+        members_order: source
diff --git a/docs/cli/graphium-train.md b/docs/cli/graphium-train.md
@@ -0,0 +1,54 @@
+# `graphium-train`
+
+To support advanced configuration, Graphium uses [`hydra`](https://hydra.cc/) to manage and write config files. A limitation of `hydra`, is that it is designed to function as the main entrypoint for a CLI application and does not easily support subcommands. For that reason, we introduced the `graphium-train` command in addition to the [`graphium`](./graphium.md) command. 
+
+!!! info "Curious about the configs?"
+    If you would like to learn more about the configs, please visit the docs [here](https://github.com/datamol-io/graphium/tree/main/expts/hydra-configs).
+
+This page documents `graphium-train`.
+
+## Running an experiment
+To run an experiment go to the `expts/hydra-configs` folder for all available configurations. For example, to benchmark a GCN on the ToyMix dataset run
+```bash
+graphium-train dataset=toymix model=gcn
+```
+To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
+```bash
+graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
+```
+or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
+Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
+```bash
+graphium-train dataset=toymix model=gcn accelerator=gpu
+```
+automatically selects the correct configs to run the experiment on GPU.
+Finally, you can also run a fine-tuning loop: 
+```bash
+graphium-train +finetuning=admet
+```
+
+To use a config file you built from scratch you can run
+```bash
+graphium-train --config-path [PATH] --config-name [CONFIG]
+```
+Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.
+
+### Preparing the data in advance
+The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.
+
+However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.
+
+The following command-line will prepare the data and cache it, then use it to train a model.
+```bash
+# First prepare the data and cache it in `path_to_cached_data`
+graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]
+
+# Then train the model on the prepared data
+graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
+```
+
+??? note "Config vs. Override"
+    As with any configuration, note that `datamodule.args.processed_graph_data_path` can also be specified in the configs at `expts/hydra_configs/`.
+
+??? note "Featurization" 
+    Every time the configs of `datamodule.args.featurization` change, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
diff --git a/docs/cli/graphium.md b/docs/cli/graphium.md
@@ -0,0 +1,156 @@
+# `graphium`
+
+**Usage**:
+
+```console
+$ graphium [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `data`: Graphium datasets.
+* `finetune`: Utility CLI for extra fine-tuning utilities.
+
+## `graphium data`
+
+Graphium datasets.
+
+**Usage**:
+
+```console
+$ graphium data [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `download`: Download a Graphium dataset.
+* `list`: List available Graphium dataset.
+* `prepare`: Prepare a Graphium dataset.
+
+### `graphium data download`
+
+Download a Graphium dataset.
+
+**Usage**:
+
+```console
+$ graphium data download [OPTIONS] NAME OUTPUT
+```
+
+**Arguments**:
+
+* `NAME`: [required]
+* `OUTPUT`: [required]
+
+**Options**:
+
+* `--progress / --no-progress`: [default: progress]
+* `--help`: Show this message and exit.
+
+### `graphium data list`
+
+List available Graphium dataset.
+
+**Usage**:
+
+```console
+$ graphium data list [OPTIONS]
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+### `graphium data prepare`
+
+Prepare a Graphium dataset.
+
+**Usage**:
+
+```console
+$ graphium data prepare [OPTIONS] OVERRIDES...
+```
+
+**Arguments**:
+
+* `OVERRIDES...`: [required]
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+## `graphium finetune`
+
+Utility CLI for extra fine-tuning utilities.
+
+**Usage**:
+
+```console
+$ graphium finetune [OPTIONS] COMMAND [ARGS]...
+```
+
+**Options**:
+
+* `--help`: Show this message and exit.
+
+**Commands**:
+
+* `admet`: Utility CLI to easily fine-tune a model on...
+* `fingerprint`: Endpoint for getting fingerprints from a...
+
+### `graphium finetune admet`
+
+Utility CLI to easily fine-tune a model on (a subset of) the benchmarks in the TDC ADMET group.
+
+A major limitation is that we cannot use all features of the Hydra CLI, such as multiruns.
+
+**Usage**:
+
+```console
+$ graphium finetune admet [OPTIONS] OVERRIDES...
+```
+
+**Arguments**:
+
+* `OVERRIDES...`: [required]
+
+**Options**:
+
+* `--name TEXT`
+* `--inclusive-filter / --no-inclusive-filter`: [default: inclusive-filter]
+* `--help`: Show this message and exit.
+
+### `graphium finetune fingerprint`
+
+Endpoint for getting fingerprints from a pretrained model.
+
+The pretrained model should be a `.ckpt` path or pre-specified, named model within Graphium.
+The fingerprint layer specification should be of the format `module:layer`.
+If specified as a list, the fingerprints from all the specified layers will be concatenated.
+See the docs of the `graphium.finetuning.fingerprinting.Fingerprinter` class for more info.
+
+**Usage**:
+
+```console
+$ graphium finetune fingerprint [OPTIONS] FINGERPRINT_LAYER_SPEC... PRETRAINED_MODEL SAVE_DESTINATION
+```
+
+**Arguments**:
+
+* `FINGERPRINT_LAYER_SPEC...`: [required]
+* `PRETRAINED_MODEL`: [required]
+* `SAVE_DESTINATION`: [required]
+
+**Options**:
+
+* `--output-type TEXT`: Either numpy (.npy) or torch (.pt) output  [default: torch]
+* `-o, --override TEXT`: Hydra overrides
+* `--help`: Show this message and exit.
diff --git a/docs/cli/reference.md b/docs/cli/reference.md
@@ -0,0 +1,11 @@
+# CLI Reference
+
+Installing the Graphium library, makes two CLI tools available. 
+
+- [`graphium-train`](./graphium-train.md) is the hydra endpoint - specifically meant for training, finetuning and testing. Since this uses `@hydra.main`, it has access to all advanced hydra functionality such as tab completion, multirun, working directory management, logging management. 
+- [`graphium`](./graphium.md) is the more general CLI endpoint, organized with various sub commands. 
+
+Ideally, we would've integrated both in a single CLI endpoint, but the hydra CLI cannot be a subcommand of another CLI, nor does it support easily adding subcommands, which is why provide two separate CLI tools with different purposes.
+
+!!! note "Interactive, embedded CLI docs with `--help`"
+    In addition to these pages, you can also use `graphium --help` and `graphium-train --help` to interactively navigate the documentation of these tools directly in the CLI.
diff --git a/graphium/cli/__main__.py b/graphium/cli/__main__.py
@@ -1,4 +1,4 @@
-from .main import app
+from graphium.cli.main import app
 
 if __name__ == "__main__":
     app()