nex2tbl is an R tool aimed to help with submission of protein-coding DNA sequences to GenBank. Such sequences are commonly submitted through BankIt portal, where a Feature Table File (*.tbl file) is prompted if the user uploads multiple records. Manual preparation of the tbl file can be a laborious task, especially if the sequences include multiple introns or start from different codon positions. nex2tbl takes aligned sequences and creates a minimum essential tbl file with 2 feature keys (gene
and CDS
) and 5 qualifiers (gene
, product
, codon_start
, transl_table
, and partial
aka <
/>
) that are altogether enough for GenBank to correctly translate DNA into amino acids.
-
Make sure that
ape
andplyr
packages are installed in your R environment. -
Load the script.
source("https://raw.githubusercontent.com/Mycology-Microbiology-Center/nex2tbl/main/nex2tbl.R")
- Specify input and output file names, as well as user-defined variables. Example:
nex2tbl(
INPUT_NEX = "exons-introns_CODON_START-2_RPB1.nex",
OUTPUT_TBL = "exons-introns_CODON_START-2_RPB1.tbl",
GENE = "rpb1",
PRODUCT = "RNA polymerase II largest subunit",
CODON_START = 2,
TRANSL_TABLE = 1,
FULL_GENE = FALSE
)
- Execute this script, and resulting tbl file will appear in your working directory.
Input for the tool is an alignment of the submitted sequences of one gene in the nexus format (*.nex, example). Intron positions should be specified in the end of the file as column spans in a single charset called intron
, like this:
BEGIN SETS;
charset intron = 202-256 394-451;
END;
In addition, the user must specify the following variables:
GENE
- gene name, e.g., "rpb1".PRODUCT
- name of the produced protein, e.g., "RNA polymerase II largest subunit".CODON_START
- indicates the offset at which the first complete codon of a coding region can be found in the alignment. It is specified in relation to the first column of the first exon (which is not necessarily in the beginning of alignment!) and can only take values 1, 2, or 3. On the example below, the first complete codon (TCC, in green) starts in the 3rd column of the first exon, thereforeCODON_START
will be 3. To define this variable the user must know the coding frame of alignment beforehand.
TRANSL_TABLE
- defines the genetic code table used, by default is 1 - universal genetic code table.FULL_GENE
- can beFALSE
orTRUE
depending on whether the sequence covers the whole coding region of a protein. Usually it is not the case, and then locations of the first and last regions (assumed to be incomplete) will be indicated with<
and>
before the numbers. IfTRUE
, GenBank expectsCODON_START
to be 1.
>Features seq4
<1 >2119 gene
gene rpb2
<1 74 CDS
128 1087
1144 >2119
product RNA polymerase II second largest subunit
codon_start 3
transl_table 1
- In exons, length of gaps must be multiple of three (e.g.
---
), or else the reading frame will be broken and the output will be wrong. - Intron-only sequences are not supported - if they are present in the alignment, warnings will be shown and such sequences will be absent in the tbl.
- If charsets are not specified, whole aligment will be treated as a single exon.
- Code: Vladimir Mikryukov
- Idea: Anton Savchenko and Iryna Yatsiuk