This repository contains the code used to generate the results reported in the LAGOM article.
LAGOM (Language-model Assisted Generation Of Metabolites) is a Transformer-based Chemformer model fine-tuned to predict likely metabolic transformations of drug candidates. As part of this work, we assembled a rigorously curated and standardised collection of publicly available datasets (DrugBank and MetXBioDB) specifically for metabolite prediction. To enhance model generalisation and predictive accuracy, we incorporated a range of data augmentation strategies. In addition, we explored the impact of different pre-training approaches, including general chemical pre-training using the Virtual Analogs dataset and metabolite-specific pre-training with MetaTrans dataset.
- Installation
- Data
- Running Chemformer
- Augmentation and Annotation
- Splitting Data for an Ensemble Model
- Evaluation
- Visualisations
- Contributors
- License
Here are the instructions on how to set up this project.
First, set up Chemformer by following the instructions given in the repository aizynthmodels. This includes cloning the repository using Git and creating the conda environment called aizynthmodels. The release version used in this project was aizynthmodels 1.0.0 and we used Python 3.10.16.
Then clone this repository using Git:
git clone https://github.com/tsofiac/LAGOM.git
cd LAGOM/Download this additional package:
conda activate aizynthmodels
pip install chembl_structure_pipeline==1.2.2The pre-trained Chemformer model used in this project is available via a link provided in the README file in the chemformer repository. We used the public model models/pre-trained/combined/step=1000000.ckpt with updated checkpoints.
Before proceeding with data extraction and curation, create a folder named dataset and, within it, the following subfolders:
raw_data: Place the raw and unmodified data files here.extracted_data: This is where data files will be located after running the load scripts specified later.curated_data: Location for data files obtained after the curation process.fine_tuning: Location for data files that will be used for fine-tuning. These are also created after the curation process.removed_data: For the data that is removed during the curation process.VA: For processed files of the Virtual Analogs (VA) dataset. This dataset gets a separate folder, as these files are too large to be committed to GitHub.VA_removed: Subfolder within theVAfolder for removed data associated with the VA dataset.
To extract the data needed for this project, download the raw files from the webpages listed below. The versions stated in the list have been used for this project. Datasets from DrugBank and MetXBioDB are used for fine-tuning, the GLORYx test dataset is used for external evaluation, and the Virtual Analogs (VA) and MetaTrans datasets are used for additional pre-training. All datasets are publically available, but a license was required to obatin data from DrugBank. For this project, a free academic license was used.
- DrugBank (Version 5.1.13)
- drugbank_drug_structures.sdf
- drugbank_metabolite_structures.sdf
- drugbank_external_structures.csv
- drugbank_full_database.xml
- MetXBioDB (Version NORMAN-SLE-S73.0.1.7)
- metxbiodb.csv
- GLORYx Test Dataset
- gloryx_test_dataset.json
- Virtual Analogs Dataset (Version v1)
- 1297204_Virtual_Analogs.dat
- ChEMBL35
- chembl_35_chemreps.txt
- MetaTrans Dataset
- source_train.txt
- target_train.txt
- source_valid.txt
- target_valid.txt
To extract the data, run the following files in this repository for each corresponding data source. The extracted data files will be saved in dataset/extracted_data.
preprocessing/load_drugbank.pyfor the DrugBank datasetpreprocessing/load_metxbiodb.pyfor the MetXBioDB datasetpreprocessing/load_gloryx.pyfor the GLORYx test datasetoptimisation/load_metatrans.pyfor the MetaTrans dataset- Due to the large size of the VA dataset, it is loaded in parts:
- Open
preprocessing/load_VA_parts.pyand uncomment astart_rowand correspondingend_row - Run the shell script
preprocessing/submit_load_mmp.sh - Repeat this for all start- and end-rows.
- Open
To preprocess the data, use the file preprocessing/preprocess_data.py.
To combine and preprocess the DrugBank and MetXBioDB datasets, set the name parameter to 'LAGOM'. LAGOM is the name of the combined fine-tuning dataset. This will generate the file for finetuning (LAGOM_finetune.csv), along with the other files, including an evaluation file and augmentation files.
(Observe that the GLORYx test dataset is not preprocessed.)
For complete curation and filtering of the VA dataset, the following steps should be conducted.
The intital filtering is done separately for each split of the extracted dataset.
- Go to file
preprocessing/preprocessing_data.py - Set
name = 'VA_filter_part'and uncomment astart_rowand correspondingend_row(an example is provided below):
if __name__ == "__main__":
''' rows for VA '''
start_row = 0
end_row = 1101304
# start_row = 1101304
# end_row = 2202607
...
name = 'VA_filter_part'- Run file
preprocessing/submit_preprocess.sh - Repeat the steps above for all splits
- Run the file
optimisation/combine_VA.pyto combine the different VA files into one
- Go back to the file
preprocessing/preprocessing_data.py - Set
name = 'VA_last_filtering' - Run the shell script
preprocessing/submit_load_mmp.sh
This outputs the file VA/VA_finetune.csv that can be used for additionally pre-training the base Chemformer model.
For filtering of the MetaTrans dataset, proceed with the following steps.
- Open file
preprocessing/preprocessing_data.py - Set
name = 'metatrans' - Run the script (no shell script needed)
This outputs the file finetune/metatrans_finetune.csv that can be used for additionally pre-training the ChemVA model.
Chemformer model training involves three main steps: pre-training, fine-tuning, and inference score. Below, each step is described using an example. Note that the dataset used in either of these steps need to be tab-separated, with the columns reactants and products, which corresponds to the datasets in dataset/finetune. For more detailed information about Chemformer, see the chemformer repository.
Pre-training allows the model to learn chemical transformations from a large dataset. The following scripts are used:
pretrain_bart.yaml: Hyperparameter settings for pre-trainingpretrain.sh: SBATCH script for submitting the jobbart_vocab_downstream.json: Vocabulary filerun_pretrain.sh: Script to execute pre-training
In order to additionally pre-train a model:
-
Edit
run_pretrain.shas needed, specifying:- Dataset file (e.g.,
VA_finetune.csvormetatrans_finetune.csv) - An already pre-trained model (ckpt-file), or
nullif starting from scratch - Batch size (e.g.,
128) - Randomisation probability (e.g.,
0.5; can be changed between 0 and 1) - Masking probability (e.g.,
0.1; can be changed between 0 and 1)
- Dataset file (e.g.,
-
Run the script:
./run_pretrain.sh
Fine-tuning adapts a pre-trained model to a specific task, such as metabolite prediction. The required scripts are analogous to pre-training:
fine_tune_bart.yaml: Hyperparameters for fine-tuningfine_tune.sh: SBATCH scriptbart_vocab_downstream.json: Vocabulary filerun_fine_tuning.sh: Script to run fine-tuning
In order to fine-tune the model for the task of metabolite prediction:
-
Edit
run_fine_tuning.sh:- Fine-tunig dataset file (e.g.,
LAGOM_finetune.csv) - A pre-trained model (ckpt-file)
- Batch size (e.g.,
64)
- Fine-tunig dataset file (e.g.,
-
Run the script:
./run_fine_tuning.sh
To evaluate a trained model, the following scripts are used:
inference_score.yaml: Hyperparameters for inference scoringinference_score.sh: SBATCH scriptbart_vocab_downstream.json: Vocabulary filerun_inference_scoring.sh: Script for evaluation
To evaluate the model, the run_pretrain.sh is adjusted accordingly:
-
Edit
run_pretrain.sh:- Set the evaluation dataset (e.g.,
LAGOM_evaluation_finetune.csvorgloryx_finetune.csv) - A pre-trained model (ckpt-file)
- Batch size (e.g.,
38) - Beam search parameter for the number of predictions per input (e.g.,
20)
- Set the evaluation dataset (e.g.,
-
Run the script:
./run_inference_scoring.sh
To augment the training dataset with parent-parent or parent-granchild reactions, or to annotate the dataset with either Csp3 fraction or LogP, proceed with the steps below.
- Open file
optimisation/augmentation_annotation.py - Set the relevant annotation or augmentataion techniques to True. The example below shows how to augment the data with parent-parent reactions.
- Both annotation techniques can be applied together, and both augmentation techniques (parent-parent and parent-grandchild reactions) can also be used together. However, note that augmentation and annotation are conducted separately.
if __name__ == "__main__":
logp_annotations = False
csp3_annotations = False
augment_parent_grandchild = False
augment_parent_parent = True- Run the script
Note: Specific data files are required for the augmentation techniques (namely curated_data/augmented_parent_grandchild.csv and curated_data/augmented_parent_parent.csv). Please ensure these files are available before proceeding with the augmentation process. They are generated when running preprocessing/preprocessing_data.py when the name parameter is set to 'LAGOM'.
For inference scoring a model that has been fine-tuned on annotated data, a corresponing annotated test set should to be used. By default, if you select an annotation technique in the script, an annotated version of the LAGOM test set will also be created, e.g. dataset/finetune/logp_evaluation_finetune.csv
The data can be split at random (stratified split), or split based the Butina clustering algorithm (parent or child similarity).
To split based on similarity (pre-requisite for child-/parent-based splitting) the following preparation steps should be conducted.
- Open the file
optimisation/tb_cluster.py - Set
name = 'child'or'parent', depending on if you want to cluster based on the metabolites or drugs, respectively - Run the script
For all different splitting methods,
- Go to the file
optimisation/split_for_ensemble.py - Set
split_typeto the splitting method of choice, i.e.,'stratified','parents'or'children'
The separate files with the different splits, e.g. dataset/finetune/strat_split1_finetune.csv can then be used for individual fine-tuning.
To read and evaluate the predictions obtained from the inference scoring, the file evaluation/prediction_reader.py can be used. This is where scores such as recall and precision are obtained.
All main choices are specified in these lines:
if __name__ == "__main__":
benchmark = False # True if GLORYx, False if Evaluation
status = 'score' # 'score' 'combine' 'new'
name = 'chemVA-Met_base_05'
specification = 0 # 0 (all) 1 (only_child) 2 (more than 1) 3 (more than 2)
fingerprint = None # 1 (similarity = 1) 0.8 (similarity >= 0.8) None (exact SMILES string)Note: Per default, the file results/evaluation/scores/predictions0.json is saved when running inference scoring. Thus, if no adjustments are made, evaluation/prediction_reader.py should be used before running a new inference scoring.
If predictions have been made on the LAGOM test set (during infrence scoring), set benchmark to False. If they have been made on the GLORYx test set, set it to True.
For evaluating an inference scoring for the first time, set status = 'new' and choose a suitable name which will be used when generating the new files, i.e.f"evaluation/scores/predictions_{name}.csv" and f"evaluation/scores/result_{name}.csv". Unless interested in specific predictions, these files will not have to be opened; all scores are printed. For printing the scores again at a later point, set status = 'score' and use the same name as previously.
specification should be set to 0 to evaluate predictions for all drugs. However, if interested in how the model performed only on for example drugs with two or more children, specification can be set to 2.
Set fingerprint to None to calculate scores based on exact matches with the actual predictions. To instead count predictions as correct when their Morgan fingerprint Tanimoto similarity is 0.8 or higher, set fingerprint to 0.8.
To evaluate an ensemble model, each part of th ensemble model has to be evaluated separately (status = 'new'). After running these evaluations, specify the result files in ensemble_list as shown below with an example. Also note the setting samples_per_model which determines how many predictions per drug from each model are combined into the ensemble. However, keep in mind that if you use four models with five predictions per drug each, the total number of predictions per drug may be less than 20 (= 4 x 5), as duplicate predictions are removed. Before running the code to combine the models, set status to combine.
ensemble_list = [
"evaluation/scores/ensemble/result_ChemVA-Met_Ssplit1.csv",
"evaluation/scores/ensemble/result_ChemVA-Met_Ssplit2.csv",
"evaluation/scores/ensemble/result_ChemVA-Met_Ssplit3.csv",
"evaluation/scores/ensemble/result_ChemVA-Met_Ssplit4.csv",
]
samples_per_model = 5The folder for_article contains scripts for generating figures and tables for the article.
for_article/data_analysis_plots.ipynb: Data analysis of the datasets and training progressfor_article/boxplots_article.ipynb: Results from evaluation
- Sofia Larsson: @tsofiac
- Miranda Carlsson: @mirandacarlsson
- Richard Beckmann: @corteswain
- Filip Miljković: @filipm90
- Rocío Mercado Oropeza: @rociomer
The software is licensed under the Apache 2.0 license (see LICENSE file), and is free and provided as-is.