Skip to content

Commit

Permalink
Merge pull request #441 from cwognum/main
Browse files Browse the repository at this point in the history
WIP: Various improvements to the CLI endpoints
  • Loading branch information
DomInvivo authored Aug 29, 2023
2 parents 0fdd458 + 3c9e327 commit 3aa098b
Show file tree
Hide file tree
Showing 18 changed files with 671 additions and 107 deletions.
8 changes: 7 additions & 1 deletion .github/workflows/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,9 @@ jobs:
cache-downloads: true

- name: Install library
run: python -m pip install --no-deps .
run: |
python -m pip install --no-deps .
pip install typer-cli
- name: Configure git
run: |
Expand All @@ -40,6 +42,10 @@ jobs:
- name: Deploy the doc
run: |
echo "Auto-generating typer docs"
typer graphium.cli.__main__ utils docs --name graphium --output docs/cli/graphium.md
echo "Get the gh-pages branch"
git fetch origin gh-pages
Expand Down
50 changes: 2 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,55 +65,9 @@ The above step needs to be done once. After that, enable the SDK and the environ
source enable_ipu.sh .graphium_ipu
```

## Training a model
## The Graphium CLI

To learn how to train a model, we invite you to look at the documentation, or the jupyter notebooks available [here](https://github.com/datamol-io/graphium/tree/master/docs/tutorials/model_training).

If you are not familiar with [PyTorch](https://pytorch.org/docs) or [PyTorch-Lightning](https://pytorch-lightning.readthedocs.io/en/latest/), we highly recommend going through their tutorial first.

## Running an experiment
We have setup Graphium with `hydra` for managing config files. To run an experiment go to the `expts/` folder. For example, to benchmark a GCN on the ToyMix dataset run
```bash
graphium-train dataset=toymix model=gcn
```
To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
```bash
graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
```
or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
```bash
graphium-train dataset=toymix model=gcn accelerator=gpu
```
automatically selects the correct configs to run the experiment on GPU.
Finally, you can also run a fine-tuning loop:
```bash
graphium-train +finetuning=admet
```

To use a config file you built from scratch you can run
```bash
graphium-train --config-path [PATH] --config-name [CONFIG]
```
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.

## Preparing the data in advance
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.

However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.

The following command-line will prepare the data and cache it, then use it to train a model.
```bash
# First prepare the data and cache it in `path_to_cached_data`
graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]

# Then train the model on the prepared data
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
```

**Note** that `datamodule.args.processed_graph_data_path` can also be specified at `expts/hydra_configs/`.

**Note** that, every time the configs of `datamodule.args.featurization` changes, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
Installing `graphium` makes two CLI tools available: `graphium` and `graphium-train`. These CLI tools make it easy to access advanced functionality, such as _training a model_, _extracting fingerprints from a pre-trained model_ or _precomputing the dataset_. For more information, visit [the documentation](https://graphium-docs.datamol.io/stable/cli/reference.html).

## License

Expand Down
8 changes: 8 additions & 0 deletions docs/api/graphium.finetuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
::: graphium.finetuning.fingerprinting.Fingerprinter
options:
filters: ["!^_"]
separate_signature: true
show_signature_annotations: true
line_length: 80
merge_init_into_class: true
members_order: source
54 changes: 54 additions & 0 deletions docs/cli/graphium-train.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# `graphium-train`

To support advanced configuration, Graphium uses [`hydra`](https://hydra.cc/) to manage and write config files. A limitation of `hydra`, is that it is designed to function as the main entrypoint for a CLI application and does not easily support subcommands. For that reason, we introduced the `graphium-train` command in addition to the [`graphium`](./graphium.md) command.

!!! info "Curious about the configs?"
If you would like to learn more about the configs, please visit the docs [here](https://github.com/datamol-io/graphium/tree/main/expts/hydra-configs).

This page documents `graphium-train`.

## Running an experiment
To run an experiment go to the `expts/hydra-configs` folder for all available configurations. For example, to benchmark a GCN on the ToyMix dataset run
```bash
graphium-train dataset=toymix model=gcn
```
To change parameters specific to this experiment like switching from `fp16` to `fp32` precision, you can either override them directly in the CLI via
```bash
graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
```
or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
```bash
graphium-train dataset=toymix model=gcn accelerator=gpu
```
automatically selects the correct configs to run the experiment on GPU.
Finally, you can also run a fine-tuning loop:
```bash
graphium-train +finetuning=admet
```

To use a config file you built from scratch you can run
```bash
graphium-train --config-path [PATH] --config-name [CONFIG]
```
Thanks to the modular nature of `hydra` you can reuse many of our config settings for your own experiments with Graphium.

### Preparing the data in advance
The data preparation including the featurization (e.g., of molecules from smiles to pyg-compatible format) is embedded in the pipeline and will be performed when executing `graphium-train [...]`.

However, when working with larger datasets, it is recommended to perform data preparation in advance using a machine with sufficient allocated memory (e.g., ~400GB in the case of `LargeMix`). Preparing data in advance is also beneficial when running lots of concurrent jobs with identical molecular featurization, so that resources aren't wasted and processes don't conflict reading/writing in the same directory.

The following command-line will prepare the data and cache it, then use it to train a model.
```bash
# First prepare the data and cache it in `path_to_cached_data`
graphium data prepare ++datamodule.args.processed_graph_data_path=[path_to_cached_data]

# Then train the model on the prepared data
graphium-train [...] datamodule.args.processed_graph_data_path=[path_to_cached_data]
```

??? note "Config vs. Override"
As with any configuration, note that `datamodule.args.processed_graph_data_path` can also be specified in the configs at `expts/hydra_configs/`.

??? note "Featurization"
Every time the configs of `datamodule.args.featurization` change, you will need to run a new data preparation, which will automatically be saved in a separate directory that uses a hash unique to the configs.
156 changes: 156 additions & 0 deletions docs/cli/graphium.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# `graphium`

**Usage**:

```console
$ graphium [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `data`: Graphium datasets.
* `finetune`: Utility CLI for extra fine-tuning utilities.

## `graphium data`

Graphium datasets.

**Usage**:

```console
$ graphium data [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `download`: Download a Graphium dataset.
* `list`: List available Graphium dataset.
* `prepare`: Prepare a Graphium dataset.

### `graphium data download`

Download a Graphium dataset.

**Usage**:

```console
$ graphium data download [OPTIONS] NAME OUTPUT
```

**Arguments**:

* `NAME`: [required]
* `OUTPUT`: [required]

**Options**:

* `--progress / --no-progress`: [default: progress]
* `--help`: Show this message and exit.

### `graphium data list`

List available Graphium dataset.

**Usage**:

```console
$ graphium data list [OPTIONS]
```

**Options**:

* `--help`: Show this message and exit.

### `graphium data prepare`

Prepare a Graphium dataset.

**Usage**:

```console
$ graphium data prepare [OPTIONS] OVERRIDES...
```

**Arguments**:

* `OVERRIDES...`: [required]

**Options**:

* `--help`: Show this message and exit.

## `graphium finetune`

Utility CLI for extra fine-tuning utilities.

**Usage**:

```console
$ graphium finetune [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `admet`: Utility CLI to easily fine-tune a model on...
* `fingerprint`: Endpoint for getting fingerprints from a...

### `graphium finetune admet`

Utility CLI to easily fine-tune a model on (a subset of) the benchmarks in the TDC ADMET group.

A major limitation is that we cannot use all features of the Hydra CLI, such as multiruns.

**Usage**:

```console
$ graphium finetune admet [OPTIONS] OVERRIDES...
```

**Arguments**:

* `OVERRIDES...`: [required]

**Options**:

* `--name TEXT`
* `--inclusive-filter / --no-inclusive-filter`: [default: inclusive-filter]
* `--help`: Show this message and exit.

### `graphium finetune fingerprint`

Endpoint for getting fingerprints from a pretrained model.

The pretrained model should be a `.ckpt` path or pre-specified, named model within Graphium.
The fingerprint layer specification should be of the format `module:layer`.
If specified as a list, the fingerprints from all the specified layers will be concatenated.
See the docs of the `graphium.finetuning.fingerprinting.Fingerprinter` class for more info.

**Usage**:

```console
$ graphium finetune fingerprint [OPTIONS] FINGERPRINT_LAYER_SPEC... PRETRAINED_MODEL SAVE_DESTINATION
```

**Arguments**:

* `FINGERPRINT_LAYER_SPEC...`: [required]
* `PRETRAINED_MODEL`: [required]
* `SAVE_DESTINATION`: [required]

**Options**:

* `--output-type TEXT`: Either numpy (.npy) or torch (.pt) output [default: torch]
* `-o, --override TEXT`: Hydra overrides
* `--help`: Show this message and exit.
11 changes: 11 additions & 0 deletions docs/cli/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# CLI Reference

Installing the Graphium library, makes two CLI tools available.

- [`graphium-train`](./graphium-train.md) is the hydra endpoint - specifically meant for training, finetuning and testing. Since this uses `@hydra.main`, it has access to all advanced hydra functionality such as tab completion, multirun, working directory management, logging management.
- [`graphium`](./graphium.md) is the more general CLI endpoint, organized with various sub commands.

Ideally, we would've integrated both in a single CLI endpoint, but the hydra CLI cannot be a subcommand of another CLI, nor does it support easily adding subcommands, which is why provide two separate CLI tools with different purposes.

!!! note "Interactive, embedded CLI docs with `--help`"
In addition to these pages, you can also use `graphium --help` and `graphium-train --help` to interactively navigate the documentation of these tools directly in the CLI.
2 changes: 1 addition & 1 deletion graphium/cli/__main__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .main import app
from graphium.cli.main import app

if __name__ == "__main__":
app()
Loading

0 comments on commit 3aa098b

Please sign in to comment.