Graphsite-classifier is a deep graph neural network to classify ligand-binding sites on proteins. It is implemented with Pytorch and Pytorch-geometric. During training, the binding sites are transformed on-the-fly to graphs that contain both spacial and chemical features. A customized graph neural network (GNN) classifier is then trained on the graph representations of the binding pockets. The following figure illustrates the application pipeline:
For more details, please reference our paper. If you find this repo useful in your work please cite our paper :)
GraphSite: Ligand Binding Site Classification with Deep Graph Learning
Wentao Shi, Manali Singha, Limeng Pu, Gopal Srivastava, Jagannathan Ramanujam, and Michal Brylinski
Biomolecules 12, no. 8 (2022): 1053
The dataset used in the experiment can be accessed via this OSF repo. The dataset consists of 21,125 binding pockets which are grouped into 14 classes. The details of the classes are described here. There are three files needed for training:
clusters.yaml
: contains information about the initial clustering information of the binding sites. Multiple clusters will be merged into one class before training.pocket-dataset.tar.gz
: contains all binding site data in this project.pops-dataset.tar.gz
: contains information of node feature contact surface area.
If you want to generate your own data, the procedures and scripts to create the .mol2
, .pops
, and .profile
files can be seen here.
There are several dependencies for the train and inference Python script:
- Pytorch
- Pytroch-gemetric
- Numpy
- PyYAML
- BioPandas
- Pandas
- Scikit-learn
- Matplotlib
- SciPy
Everything of the graph neural network implementation is at ./gnn
. The configuration of training is in the ./gnn/train_classifier.yaml
. To use the default architecture and hyperparamters for training, which we recommend, the user only have to make the following modifications:
- set
cluster_file_dir
to the path ofclusters.yaml
you downloaded. - set
pocket_dir
to the path of uncompresseddataset.tar.gz
you downloaded. - set
pop_dir
to the path of uncompressedpops.tar.gz
you downloaded. - set
trained_model_dir
: to the directory where you want the trained model to be saved. - set
loss_dir
andconfusion_matrix_dir
to the directory where you want to save other training results. If you want to try to play with the model, feel free to tune the hyperparameters and try other models.
After the training confiruations are set, simply
cd ./gnn
python train_classifier.py
The inference script requires 3 input arguments:
unseen_data_dir
: directory of unseen data. For each pocket, there should be 3 associated files:.mol2
,.pops
, and.profile
. For example, a pocket on protein6af2A
needs the following 3 files:
6af2A.pops
6af2A.profile
6ag5A00.mol2
unseen_data_classes
: a yaml file containing 14 lists which represent the classes of data. If there is no data in a class, it should correspond to an empty list. Seeunseen-pocket-lists.yaml
as an example.trained_model
: the path to the trained model. After the inference data are prepared, run the following script to test the model:
python inference.py -unseen_data_dir ../unseen-data/unseen_pdb/ -unseen_data_classes ../unseen-data/unseen-pocket-list_new.yaml -trained_model ../trained_models/trained_classifier_model_63.pt