Skip to content

Commit

Permalink
add Ralated work, Models, Obtaining CIs, and Implementation sections
Browse files Browse the repository at this point in the history
  • Loading branch information
GeorgeBatch committed Sep 16, 2020
1 parent 919a6eb commit d6146e2
Showing 1 changed file with 35 additions and 1 deletion.
36 changes: 35 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,22 @@ This repository contains all code, results, and plots I produced while completin

In this work, I explore ways of quantifying the confidence of machine learning models used in drug discovery. In order to do this, I start with exploring methods to predict physicochemical properties of drugs and drug-like molecules crucial to drug discovery. I first attempt to reproduce and improve upon a subset of results to do with a drug's solubility in water, taken from a popular benchmark set called "MoleculeNet". Using XGBoost, which in the era of Deep Neural Networks, is already classified as a "conventional" machine learning method, I show that I am able to achieve state-of-the-art results. After that, I explore Gaussian Processes and Infinitesimal Jackknife for Random Forests and their associated uncertainty estimates. Finally, I attempt to understand whether the confidence of a model's prediction can be used to answer a similar but more general question: "*How do we know when to trust our models?*" The answer depends on the model. We can trust Gaussian Processes when they are confident, but the confidence estimates from Random Forests do not give us any assurance.

## Related work

This work is mostly based of four papers:
- "MoleculeNet: A Benchmark for Molecular Machine Learning" by [Wu *et al.*](https://pubs.rsc.org/en/content/articlelanding/2018/SC/C7SC02664A#!divAbstract);
- "Learning From the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction" by [Boyles *et al.*](https://academic.oup.com/bioinformatics/article-abstract/36/3/758/5554651?redirectedFrom=fulltext);
- "The Photoswitch Dataset: A Molecular Machine Learning Benchmark for the Advancement of Synthetic Chemistry" by [Thawani *et al.*](https://chemrxiv.org/articles/preprint/The_Photoswitch_Dataset_A_Molecular_Machine_Learning_Benchmark_for_the_Advancement_of_Synthetic_Chemistry/12609899); and
- "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" by [Wager *et al.*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286302/).

## Aims

In this dissertation I aim to achieve three primary goals:

1. Reproduce a subset of solubility-related prediction results from the MoleculeNet benchmarking paper;
2. Improve upon the reproduced results; and
3. Use uncertainty estimation methods with the best-performing models to get single prediction uncertainty estimates to evaluate and compare these methods.

## Data

I used the [MoleculeNet dataset](http://moleculenet.ai/datasets-1) which accompanies the [MoleculeNet benchmarking paper](https://pubs.rsc.org/en/content/articlelanding/2018/sc/c7sc02664a#!divAbstract), and in particular, I focused on the Physical Chemistry datasets: [ESOL](https://pubs.acs.org/doi/10.1021/ci034243x), [FreeSolv](https://link.springer.com/article/10.1007/s10822-014-9747-x), and [Lipophilicity](https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.2718). The MoleculeNet datasets are widely used to validate machine learning models used to estimate a particular property directly from small molecules including drug-like compounds.
Expand All @@ -30,11 +46,29 @@ The Physical Chemistry datasets can be downloaded from [MoleculeNet benchmark da

## Models

I use the following four models for the regression task of physicochemical property prediction:

- [Kernel Ridge Regression](https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf)
- [eXtreme Gradient Boosting (XGBoost)](https://dl.acm.org/doi/10.1145/2939672.2939785)
- [Random Forests](https://link.springer.com/article/10.1023/A:1010933404324)
- [Gaussian Processes](http://www.gaussianprocess.org/gpml/)

## Obtaining Confidence Intervals

I obtaing per-prediction confidence intervals with:

- Gaussian Processes ([notes, chapter 7, section 7.2](https://github.com/ywteh/advml2020/blob/master/notes.pdf))
- Bias-corrected Infinitesimal Jackknife estimate for Random Forests ([paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4286302/))

## Implementation

All the data preparation, experiments, and visualisations were done in Python.

To convert molecules from their [SMILES](https://pubs.acs.org/doi/abs/10.1021/ci00057a005) string representations to either Molecular Descriptors or [Extended-Connectivity Fingerprints](https://pubs.acs.org/doi/10.1021/ci100050t), I used the open-source cheminformatics software, [RDKit](https://www.rdkit.org/).

[Wu *et al.*](https://pubs.rsc.org/en/content/articlelanding/2018/SC/C7SC02664A#!divAbstract) suggest to use their Python library, [DeepChem](https://www.deepchem.io/), to reproduce the results. We decided not to use it, since the user API only gives high-level access to the user, while I wanted to have more control of the implementation. To have comparable results, I decided to use the tools which the DeepChem library is built on.

For most of the machine learning pipeline, I used Scikit-Learn ([article](https://www.jmlr.org/papers/v12/pedregosa11a.html), [GitHub](https://github.com/scikit-learn/scikit-learn)) for preprocessing, splitting, modelling, prediction, and validation. To obtain the confidence intervals for Random Forests, I used the forestci ([article](https://joss.theoj.org/papers/10.21105/joss.00124), [GitHub](https://github.com/scikit-learn-contrib/forest-confidence-interval)) extension for Scikit-Learn. The implementation of a custom Tanimoto (Jaccard) kernel for Gaussian Process Regression and all the following GP experiments were performed with GPflow ([article](http://jmlr.org/papers/v18/16-537.html), [GitHub](https://github.com/GPflow/GPflow)).

# Set-up

Expand Down Expand Up @@ -127,7 +161,7 @@ The resulting files are saved in the `~/data/` directory:
- `freesolv_original_IdSmilesLabels.csv`
- `lipophilicity_original_IdSmilesLabels.csv`

**Note:** the original file for the ESOL dataset also contained extra features which we also saved here.
**Note:** the original file for the ESOL dataset also contained extra features which we also save here.

### Compute and Store Features

Expand Down

0 comments on commit d6146e2

Please sign in to comment.