Skip to content
Salma Sohrabi edited this page Feb 13, 2021 · 14 revisions

BMF User Guide

Table of contents

Summary

Knowing the basis of protein-RNA recognition is essential to understanding regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow binding to the RNA molecule in a multivalent manner. Through cooperative binding these proteins can reach higher specificity and affinity than those of single RNA-binding domains. However, current approaches to RNA de novo motif discovery do not take the modularity of binding events into account. Here we present Bipartite Motif Finder (BMF), an RNA motif finder that is based on a thermodynamic model of an RBP with two binding sites acting cooperatively in targeting an RNA molecule. We show that bipartite binding is a common strategy among RBPs to achieve higher levels of sequence specificity. We furthermore illustrate that the spacial geometry between the two binding sites can be learnt from bound RNA sequences and that this information enhances the model's accuracy in predicting new binding sites. These bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions.

BMF is also available as a webserver:

Installation

Requirements

  • python>3.6
  • numpy
  • cython

Installing requirements with Conda:

Create a new conda environment with python, numpy, and cython:

conda create -n bmf python=3.6 numpy cython
conda activate bmf

Installing requirements on Ubuntu without Conda:

sudo apt-get update
sudo apt-get install python3.6 python3-pip
pip3 install numpy cython

Installing requirements on MacOS with brew:

brew install python3
pip install numpy cython

BMF installation

  1. Optional: BMF is also available as a faster version for running on AVX2 extension capable processor. You can check if AVX2 is supported by executing cat /proc/cpuinfo | grep avx2 on Linux and sysctl -a | grep machdep.cpu.leaf7_features | grep AVX2 on MacOS). If your processor supports AVX2, run the following command to compile a faster version of BMF:
export USE_AVX=1
  1. Install BMF with pip:
pip install https://github.com/soedinglab/bipartite_motif_finder/releases/download/v1.0.0a/bmf_tool-1.0.0.tar.gz

See BMF help page:

bmf --help

BMF guide

BMF has three main functionalities: (1) learning de novo bipartite motifs from enriched and background sequence sets, (2) plotting the motif and predicting if the RNA-binding protein has a bipartite motif or not, and (3) using the trained BMF model to predict binding to new sequences.

In the following sections, we describe how to use BMF to perform each of these functionalities.

Motif discovery

You can call the command-line tool bmf to perform de novo motif discovery. Here is a list of parameters that you can pass to bmf for training:

positional arguments:

  sequences             path to positive sequences enriched with the
                        motif.

compulsory arguments:

  --BGsequences BGSEQUENCES
                        path to background sequences.

optional arguments:

  --input_type {fasta,fastq,seq}
                        format of input sequences. Can be "fasta", "fastq",
                        or "seq". [Default:"fasta"]

  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif.
                        [Default:3]

  --no_tries NO_TRIES   the number of times the program is run with random
                        initializations.
                        [Default:5]

  --output_prefix OUTPUT_PREFIX
                        output file prefix. You can specify a directory e.g. 
                        "--output_prefix output_dir/my_prefix"
                        [Default:"bipartite"]

  --var_thr VAR_THR     variability threshold condition to stop ADAM
                        [Default:0.03]

  --batch_size BATCH_SIZE
                        the number of sequences processed in each batch of stochastic gradient descent.
                        [Default:512]

  --max_iterations MAX_ITERATIONS
                        max number of iterations before stopping ADAM.
                        [Default:1000]

  --no_cores NO_CORES   the numbers of CPU cores used
                        [Default:4]

Run BMF with multiple random initializations

You can run BMF with with n random parameter initializations by specifying --no_tries n. Even though BMF is robust to parameter initializations, this ensures that the best likelihood model would be found. In our manuscript we run BMF with --no_tries 5. We develop the BMF workflow in a way that when multiple initializations are performed, the best likelihood solution will be used to generate the sequence logo and to predict binding.

Output file name

You can specify the output file name with --output_prefix path-to-file/file-name. BMF will generate the following outputs for each round of parameter initialization i (i is between 1 and n for n initializations):

  • Plots of parameter changes over iterations and training set ROC curve: path-to-file/file-name_cs{motif_length}_{i}.pdf & .png
  • Model parameters path-to-file/file-name_cs{motif_length}_{i}.txt

Input file formats

You can run BMF with traditional "fasta" and "fastq" file formats. Additionally you can provide just the sequences in the following format which we refer to as "seq":

AGGCTCGGTTACGTGCAGGGCCTGATGTTCTTGATCTGTT
CTTCCAAGGAAGCTTTGACTCACAGAAATGGTAAAGTCCA
TCCCTTCGCTAAGTAGGGACGCCTCGGGCGAGACAATAGC
GAGGTGGGCTCGCGTACCTCACTTACACCATGCGCCTCAT
...

Note: The input sequences should be of equal lengths, and can only consist of the characters: A, C, G, T, U, and N.

Generating motif logo

You can generate bmf logo plots, using the parameter files generated via bmf in the previous step. To do so you need to call bmf_logo with the following parameters:

positional arguments:
  parameter_prefix      path-to-bmf-param-file that specifies model parameters or
                        when multiple parameters exist, their common root.

optional arguments:
  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif
                        [Default:3]

Please note that parameter_prefix corresponds to output_prefix in the previous step. When multiple initializations were used, bmf_logo reads all and selects the best likelihood solution to generate the motif logo.

The BMF logo plot is stored at {parameter_prefix}_seqLogo.pdf & .png.

Predict binding to new sequences

You can use the trained BMF model parameters to predict binding scores for new sequences. To do so you should run bmf with --predict. Here is a list of parameters that you can pass to bmf for predicting:

positional arguments:

  sequences             path to test sequences.

compulsory arguments:

  --test

  --model_parameters MODEL_PARAMETERS
                        path to .txt file that specifies model parameters, 
                        or the output_prefix used when training bmf.

optional arguments:

  --input_type {fasta,fastq,seq}
                        format of input sequences. Can be "fasta", "fastq",
                        or "seq". 
                        [Default:"fasta"]

  --motif_length MOTIF_LENGTH
                        the length of each core in the bipartite motif.
                        [Default:3]

The binding score for each sequence is saved in the file {model_parameters}.predictions. Note: these values correspond to the summation of statistical weights over all possible configurations. Higher values correspond to a higher binding probability. Based on our thermodynamic model, these values can be converted to binding probabilities with the following formula:

Example workflow

You can find the fasta files needed to run this example in data directory. Here we run BMF with one random parameter initialization. You can change the --no_tries to increase the number of BMF runs with new initial parameter values. The best likelihood solution would be used in this case to plot the BMF logo, and to predict binding to new sequences.

Motif discovery

You can use bmf in training mode for de novo motif discovery. By default, BMF runs over a maximum of 1000 iterations.

bmf positives_AAA_CCC.fasta --BGsequences negatives_AAA_CCC.fasta --input_type fasta --output_prefix AAA_CCC --motif_length 3  --no_tries 1

Getting sequence logo

You can use bmf_logo to plot the best likelihood motif model generated by BMF. Specify the output_prefix from the previous step to allow bmf_logo to find all associated parameter files. Here we use AAA_CCC to specify the outputs from the previous run:

bmf_logo AAA_CCC --motif_length 3

Predicting binding to new sequences

You can use the trained BMF model parameters to predict binding scores for new sequences. To specify --model_parameters, use the output_prefix from the first step (here AAA_CCC).

bmf test_sequences.fasta  --predict --input_type fasta --model_parameters AAA_CCC --output_prefix predict_test_sequences

License terms

The software is made available under the terms of the GNU General Public License v3.