Skip to content

Ultra-efficient and sensitive method to search for Open Reading Frames in spliced genomes guided by reference annotation to maximize protein similarity within genes.

License

alevar/ORFanage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

BioConda Install Github All Releases Documentation Status GPLv3 License

ORFanage aids in finding the best matching ORF for each transcript in the GTF file based on evidence from one or more reference annotaitons. The method is designed to identify cases of known ORFs fitting the query transcript both with and without modifications, introduced by additional exons, alternative start and end sites, etc. ORFanage is also designed to quantify any changes to the reference annotation which are introduced by the splice variation.

Varabyou, A., Erdogdu, B., Salzberg, S. L., & Pertea, M. (2023). Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. bioRxiv, 2023-03.

A much more comprehensive documentation for ORFanage is provided on ReadTheDocs! Please check it out to see examples workflows, some interesting results and more.

By far the easiest way to install ORFanage is by using BioConda.

$ conda install -c conda-forge -c bioconda orfanage

If you want to build it from source, we recommend cloning the git repository as shown below.

$ git clone https://github.com/alevar/ORFanage.git --recursive
$ cd ORFanage
$ cmake -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles" .
$ make -j4

For a fully static build -DORFANAGE_STATIC_BUILD=1 needs to be added to the list of arguments in the cmake command.

By default make install will likely require administrative privileges. To specify custom installation path -DCMAKE_INSTALL_PREFIX=<custom/installation/path> needs to be added to the list of arguments in the cmake command.

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately (git submodule update --init --recursive).

Requirements
Operating System GNU/Linux
Architecture Intel/AMD platforms that support POPCNT
Compiler GCC ≥ 4.9, Clang ≥ 3.8
Build system CMake ≥ 3.2
Language support C++14

Usage: orfanage [OPTIONS] <templates>...

Arguments:

<templates> One or more GFF/GTF files with coding exons to be used as
templates.

Options:

--query STRING Path to a GTF query file with transcripts to which CDSs are to be ported
--output STRING
 Basename for all output files generated by this software
--reference STRING
 Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pd.
--cleanq If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter
--cleant If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter
--rescue If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter
--lpd INT Percent difference by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed.
--ilpd INT Percent difference by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed.
--mlpd INT Percent difference by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed.
--minlen INT Minimum length of an open reading frame to consider for the analysis
--mode STRING Which CDS to report: ALL, LONGEST, LONGEST_MATCH, BEST. Default: LONGEST_MATCH
--stats STRING Output a separate file with stats for each query/template pair
--threads INT Number of threads to run in parallel
--use_id If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids.
--non_aug If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon.
--keep_cds If enabled, any CDS already presernt in the query will be kept unmodified.
--pi INT Percent identity between the query and template sequences. This option requires --reference parameter to be set. If enabled - will run alignment between passing pairs.
--gapo INT Gap-open penalty
--gape INT Gap-extension penalty

Help options:

--help Prints this help message.

Sample datasets are provided in the "example" directory to test and get familiar with ORFanage. The included examples can be run with the following base commands:

  1. orfanage --reference <path/to/grch38.fa> --output example/output.gtf --query example/query.gtf <--additional arguments> --stats example/stats.tsv example/template.gtf

About

Ultra-efficient and sensitive method to search for Open Reading Frames in spliced genomes guided by reference annotation to maximize protein similarity within genes.

Resources

License

Stars

Watchers

Forks

Packages

No packages published