ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

Introduction
Publications
Documentation
Installation
- BioConda
- Building from source
Getting started
Data

Introduction

ORFanage aids in finding the best matching ORF for each transcript in the GTF file based on evidence from one or more reference annotaitons. The method is designed to identify cases of known ORFs fitting the query transcript both with and without modifications, introduced by additional exons, alternative start and end sites, etc. ORFanage is also designed to quantify any changes to the reference annotation which are introduced by the splice variation.

Publications

Varabyou, A., Erdogdu, B., Salzberg, S. L., & Pertea, M. (2023). Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. bioRxiv, 2023-03.

Documentation

A much more comprehensive documentation for ORFanage is provided on ReadTheDocs! Please check it out to see examples workflows, some interesting results and more.

Installation

BioConda

By far the easiest way to install ORFanage is by using BioConda.

$ conda install -c conda-forge -c bioconda orfanage

Building from source

If you want to build it from source, we recommend cloning the git repository as shown below.

$ git clone https://github.com/alevar/ORFanage.git --recursive
$ cd ORFanage
$ cmake -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles" .
$ make -j4

For a fully static build -DORFANAGE_STATIC_BUILD=1 needs to be added to the list of arguments in the cmake command.

By default make install will likely require administrative privileges. To specify custom installation path -DCMAKE_INSTALL_PREFIX=<custom/installation/path> needs to be added to the list of arguments in the cmake command.

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately (git submodule update --init --recursive).

Requirements

Operating System	GNU/Linux
Architecture	Intel/AMD platforms that support POPCNT
Compiler	GCC ≥ 4.9, Clang ≥ 3.8
Build system	CMake ≥ 3.2
Language support	C++14

Getting started

Usage: orfanage [OPTIONS] <templates>...

Arguments:

<templates> One or more GFF/GTF files with coding exons to be used as

templates.

Options:

--query STRING Path to a GTF query file with transcripts to which CDSs are to be ported

--output STRING

Basename for all output files generated by this software

--reference STRING

Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pd.

--cleanq If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter

--cleant If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter

--rescue If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter

--lpd INT Percent difference by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed.

--ilpd INT Percent difference by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed.

--mlpd INT Percent difference by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed.

--minlen INT Minimum length of an open reading frame to consider for the analysis

--mode STRING Which CDS to report: ALL, LONGEST, LONGEST_MATCH, BEST. Default: LONGEST_MATCH

--stats STRING Output a separate file with stats for each query/template pair

--threads INT Number of threads to run in parallel

--use_id If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids.

--non_aug If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon.

--keep_cds If enabled, any CDS already presernt in the query will be kept unmodified.

--pi INT Percent identity between the query and template sequences. This option requires --reference parameter to be set. If enabled - will run alignment between passing pairs.

--gapo INT Gap-open penalty

--gape INT Gap-extension penalty

Help options:

--help Prints this help message.

Data

Sample datasets are provided in the "example" directory to test and get familiar with ORFanage. The included examples can be run with the following base commands:

orfanage --reference <path/to/grch38.fa> --output example/output.gtf --query example/query.gtf <--additional arguments> --stats example/stats.tsv example/template.gtf

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.vscode		.vscode
docs		docs
example		example
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.rst		README.rst
orfanage.cpp		orfanage.cpp
orfcompare.cpp		orfcompare.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

Introduction

Publications

Documentation

Installation

BioConda

Building from source

Getting started

Data

About

Uh oh!

Releases 8

Packages

Languages

`--query STRING`	Path to a GTF query file with transcripts to which CDSs are to be ported
`--output STRING`
	Basename for all output files generated by this software
`--reference STRING`
	Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pd.
`--cleanq`	If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter
`--cleant`	If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter
`--rescue`	If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter
`--lpd INT`	Percent difference by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed.
`--ilpd INT`	Percent difference by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed.
`--mlpd INT`	Percent difference by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed.
`--minlen INT`	Minimum length of an open reading frame to consider for the analysis
`--mode STRING`	Which CDS to report: ALL, LONGEST, LONGEST_MATCH, BEST. Default: LONGEST_MATCH
`--stats STRING`	Output a separate file with stats for each query/template pair
`--threads INT`	Number of threads to run in parallel
`--use_id`	If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids.
`--non_aug`	If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon.
`--keep_cds`	If enabled, any CDS already presernt in the query will be kept unmodified.
`--pi INT`	Percent identity between the query and template sequences. This option requires --reference parameter to be set. If enabled - will run alignment between passing pairs.
`--gapo INT`	Gap-open penalty
`--gape INT`	Gap-extension penalty

Uh oh!

License

Uh oh!

alevar/ORFanage

Folders and files

Latest commit

History

Repository files navigation

ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages