Skip to content

Latest commit

 

History

History
122 lines (91 loc) · 4.52 KB

README.md

File metadata and controls

122 lines (91 loc) · 4.52 KB

SVM-BPfinder - A tool for mammalian BP prediction - Python 3 & M series version

Originally written by Andre Corvelo (2010) Python 3 version and svm_light code compilation for M series provided by https://github.com/santataRU

This tool is free for purpose of academic, non-commercial research. The software must not be further distributed without prior permission of the authors.

CONTACTS:

acorvelo[at]gmail.com
eduardo.eyras[at]anu.edu.au

REFERENCE:

A. Corvelo, M. Hallegger, C.W.J. Smith, E. Eyras. (2010). Genome-wide Association between Branch Point Properties and Alternative Splicing

IMPORTANT:

SVM-BPfinder uses SVMlight:

http://osmot.cs.cornell.edu/svm_light/current/

A compiled version of 'svm_classify' for the M series (thanks https://github.com/santataRU ) is available at 'SCRIPTS/svm_classify' in this repository

Make sure you have permission to execute these three scripts:

  1. 'svm_bpfinder.py'
  2. 'SCRIPTS/svm_getfeat.py'
  3. 'SCRIPTS/svm_classify'

To know more about SVMlight, please visit http://svmlight.joachims.org/


USAGE:

svm_bpfinder.py [-h] -i FNAME -s {Hsap,Ptro,Mmul,Mmus,Rnor,Cfam,Btau}
                [-l MAXSLEN] [-d MINDIST3SS]

ARGUMENTS:

  -h, --help            show this help message and exit

  -i FNAME, --input FNAME
                        Input file name. (Multi)FASTA format only.

  -s {Hsap,Ptro,Mmul,Mmus,Rnor,Cfam,Btau}, --species {Hsap,Ptro,Mmul,Mmus,Rnor,Cfam,Btau}
                        Species under study. Case sensitive.
                        Options: Hsap (Homo sapiens) Ptro (Pan troglodytes)
                        Mmul (Macaca mulatta) Mmus (Mus musculus) Rnor (Rattus
                        norvegicus) Cfam (Canis familiaris) Btau (Bos taurus).

  -l MAXSLEN, --max-len MAXSLEN
                        Number of bases at the 3' end of the input sequences that
                        will be scanned. SVM-BPfinder assumes they correspond
                        to the 3' end of introns. For sequences of length
                        smaller than this value, the entire sequence will be
                        scanned. Default value is 100.

  -d MINDIST3SS, --min-dist MINDIST3SS
                        Distance in nucleotides allowed between the branch
                        point A and the 3' splice-site. Default value is 15.

OUTPUT:

Results are printed to STDOUT, tab delimited, one line per BP candidate. Header included.

Output fields:

seq_id  - Sequence Identifier
agez    - AG dinucleotide Exclusion Zone length
ss_dist - Distance to 3' splice-site
bp_seq  - BP sequence (9-mer; from -5 to +3 relative to the BP adenine)
bp_scr  - BP sequence score using a variable order Markov model
y_cont  - Pyrimidine content between the BP adenine and the 3' splice-site
ppt_off - Polypyrimidine tract offset relative to the BP adenine
ppt_len - Polypyrimidine tract length
ppt_scr - Polypyrimidine tract score
svm_scr - Final BP score using the SVM classifier

EXAMPLE:

The command

./svm_bpfinder.py -i test_A1CF_intron.fa -s Hsap -l 100 -d 10

scans the 3'-most 100nts of every sequence contained in the FASTA file named 'test_A1CF_intron.fa', using a human-specific model ('Hsap'), and predicts BPs as close as 10nt from the 3' splice-site.

The output should be

seq_id    agez      ss_dist   bp_seq    bp_scr    y_cont    ppt_off   ppt_len   ppt_scr   svm_scr
A1CF;chr10:52623841-52645340:-     63   89   atatgattc  -1.18388285163      0.761904761905      1         11        25        0.35390249
A1CF;chr10:52623841-52645340:-     63   76   ctgtaatac  0.931808096058      0.760563380282      13        57        99        1.1116549
A1CF;chr10:52623841-52645340:-     63   70   tactaaccg  3.97557240521       0.784615384615      7         57        99        2.6909905
A1CF;chr10:52623841-52645340:-     63   31   ttttgaccc  1.77449397266       0.807692307692      1         24        44        1.7040359
A1CF;chr10:52623841-52645340:-     63   16   ctctcaccc  2.75935904557       0.727272727273      1         9         16        1.8028404

NOTE:

SVM_BPfinder works by calling two scripts/programs:

  1. 'SCRIPTS/svm_getfeat.py' - collects features
  2. 'SCRIPTS/svm_classify' - scores candidate BPs

This creates two files in the working directory, which are removed once the final results are displayed. In case the process gets interrupted, these two files might be left on your system.