deepRAM is an end-to-end deep learning toolkit for predicting protein binding sites and motifs. It helps users run experiments using many state-of-the-art deep learning methods and addresses the challenge of selecting model parameters in deep learning models using a fully automatic model selection strategy. This helps avoid hand-tuning and thus removes any bias in running experiments, making it user friendly without losing its flexibility. While it was designed with ChIP-seq and CLIP-seq data in mind, it can be used for any DNA/RNA sequence binary classification problem.
deepRAM allows users the flexibility to choose a deep learning model by selecting its different components: input sequence representation (one-hot or k-mer embedding), whether to use a CNN and how many layers, and whether to use an RNN, and the number of layers and their type. For CNNs, the user can choose to use dilated convolution as well.
We recommend to use Anaconda 3 platform.
- Python 3.6
- PyTorch 1.0 library (Deep learning library)
- sklearn (Machine learning library)
- gensim (library used to train word2vec algorithm)
- numpy
usage: deepRAM.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
[--data_type DATA_TYPE] [--train TRAIN]
[--predict_only PREDICT_ONLY]
[--evaluate_performance EVALUATE_PERFORMANCE]
[--models_dir MODELS_DIR] [--model_path MODEL_PATH]
[--motif MOTIF] [--motif_dir MOTIF_DIR]
[--tomtom_dir TOMTOM_DIR] [--out_file OUT_FILE]
[--Embedding EMBEDDING] [--Conv CONV] [--RNN RNN]
[--RNN_type RNN_TYPE] [--kmer_len KMER_LEN]
[--stride STRIDE] [--word2vec_train WORD2VEC_TRAIN]
[--word2vec_model WORD2VEC_MODEL]
[--conv_layers CONV_LAYERS] [--dilation DILATION]
[--RNN_layers RNN_LAYERS]
sequence specificities prediction using deep learning approach
optional arguments:
-h, --help show this help message and exit
--train_data TRAIN_DATA
path for training data with format: sequence label
--test_data TEST_DATA
path for test data containing test sequences with or
without label
--data_type DATA_TYPE
type of data: DNA or RNA. default: DNA
--train TRAIN use this option for automatic calibration, training
model using train_data and predict labels for
test_data. default: True
--predict_only PREDICT_ONLY
use this option to load pretrained model (found in
model_path) and use it to predict test sequences
(train will be set to False). default: False
--evaluate_performance EVALUATE_PERFORMANCE
use this option to calculate AUC on test_data. If
True, test_data should be format: sequence label.
default: False
--models_dir MODELS_DIR
The directory to save the trained models for future
prediction including best hyperparameters and
embedding model. default: models/
--model_path MODEL_PATH
If train is set to True, This path will be used to
save your best model. If train is set to False, this
path should have the model that you want to use for
prediction. default: BestModel.pkl
--motif MOTIF use this option to generate motif logos. default:
False
--motif_dir MOTIF_DIR
directory to save motifs logos. default: motifs
--tomtom_dir TOMTOM_DIR
directory of TOMTOM, i.e:meme-5.0.3/src/tomtom
--out_file OUT_FILE The output file used to store the prediction
probability of testing data
--Embedding EMBEDDING
Use embedding layer: True or False. default: False
--Conv CONV Use conv layer: True or False. default: True
--RNN RNN Use RNN layer: True or False. default: False
--RNN_type RNN_TYPE RNN type: LSTM or GRU or BiLSTM or BiGRU. default:
BiLSTM
--kmer_len KMER_LEN length of kmer used for embedding layer, default= 3
--stride STRIDE stride used for embedding layer, default= 1
--word2vec_train WORD2VEC_TRAIN
set it to False if you have already trained word2vec
model. If you set it to False, you need to specify the
path for word2vec model in word2vec_model argument.
default: True
--word2vec_model WORD2VEC_MODEL
If word2vec_train is set to True, This path will be
used to save your word2vec model. If word2vec_train is
set to False, this path should have the word2vec model
that you want to use for embedding layer. default:
word2vec
--conv_layers CONV_LAYERS
number of convolutional modules. default= 1
--dilation DILATION the spacing between kernel elements for convolutional
modules (except the first convolutional module).
default= 1
--RNN_layers RNN_LAYERS
number of RNN layers. default= 1
You need to install WebLogo and TOMTOM in MEME Suite to match identifyed motifs with known motifs of Transcription Factors and RBPs. Read documentations about installation and usage.
- Download deepRAM
git clone https://github.com/MedChaabane/deepRAM.git
cd deepRAM
- Install required packages
pip3 install -r Prerequisites
- Install deepRAM
python setup.py install
- ChIP-seq datasets can be downloaded from: http://tools.genes.toronto.edu/deepbind/nbtcode
- CLIP-seq datasets can be downloaded from: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip
We have provided two preprocessing scripts to change the format of the used datasets to a format compatible with deepRAM input data format (deepRAM input data format: sequence label. See Example input data):
- preprocess_1.py can be used for DeepBind-ENCODE-ChIP-seq-data-like format and,
- preprocess_2.py can be used for iONMF-CLIP-seq-data-like format.
python preprocess_2.py --CLIP_data datasets/CLIP-seq/1_PARCLIP_AGO1234_hg19/30000/training_sample_0/sequences.fa.gz --output CLIP_train.gz
python preprocess_2.py --CLIP_data datasets/CLIP-seq/1_PARCLIP_AGO1234_hg19/30000/test_sample_0/sequences.fa.gz --output CLIP_test.gz
python deepRAM.py --train_data CLIP_train.gz --test_data CLIP_test.gz --data_type RNA --train True --evaluate_performance True --model_path DeepBind.pkl --out_file prediction.txt --Embedding False --Conv True --RNN False --conv_layers 1
python deepRAM.py --test_data CLIP_test.gz --data_type RNA --predict_only True --model_path DeepBind.pkl --motif True --motif_dir motifs --tomtom_dir meme-5.0.3/src/tomtom --out_file prediction.txt --Embedding False --Conv True --RNN False --conv_layers 1
make sure to specify the directory of TOMTOM in --tomtom_dir argument