-
Notifications
You must be signed in to change notification settings - Fork 2
Add new taxa to HaMStR
To add a new taxon to HaMStR-oneSeq, you need to follow its naming schema ([Species acronym]@[NCBI ID]@[Proteome version]) and place the necessary files in the correct folders:
- genome_dir (Contains sub-directories for proteome fasta files for each species)
- blast_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
- weight_dir (Contains feature annotation files for each proteome)
We simplify this process by providing 2 functions addTaxon1s
and addTaxa1s
.
Note: before using, please read the More section.
For this, you can use the addTaxon1s
function:
addTaxon1s -f your_genome.fa -i tax_id -c [-o /output/directory] [-n abbr_tax_name]
If the abbr. taxon name is not given using the option -n abbt_tax_name
, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid
(e.g. UNK12345678).
The script will add a new folder named abbr_tax_name@tax_id@1
and the corresponding content into genome_dir and blast_dir , as well as a annotation abbr_tax_name@tax_id@1.json
file to weight_dir. These 3 folders will be saved in /output/directory
. If not specified, new taxon will be added into the same directory of pre-calculated data.
The header of new FASTA sequence, i.e. the sequence ID, will be the first word of the original FASTA sequence. Everything after the first whitespace will be removed. If the first word is duplicated between different sequences, an increasing index will be added to make sure that the sequence IDs of the new FASTA file are unique.
Example, a before fasta file:
>EXR66326.1 biofilm-associated domain protein, partial [Acinetobacter baumannii 339786]
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1 hypothetical protein J700_4015, partial [Acinetobacter baumannii 339786]
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...
and after (this is how your new sequence data will look like):
>EXR66326.1
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...
In most of the cases, you would need to add more than one taxon into HaMStR. For this purpose, the addTaxaHamstr
function can be used:
addTaxaHamstr -i /path/to/taxa/fasta -m mapping_file -c [-o /output/directory]
/path/to/taxa/fasta
is a folder where the FASTA files of all new taxa can be found. mapping_file
is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:
#filename tax_id abbr_tax_name version
filename.faa 9606
filename1.fa 12345678
filename2.fasta 4932 my_fungi
...
The header line (started with #) is a Must. The values of the last 2 columns (abbr. taxon name and genome version) are, however, optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.
If the abbr. taxon name is not given, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid
(e.g. UNK12345678).
The script will check if the combination abbr_tax_name@tax_id@version
already exists in /output/directory/genome_dir, it will give an error message and it need to be solved before continuing.
These taxa are probably already present in /Users/vinh/bionf/HaMStR/genome_dir:
filename.faa HUMAN@9606@3
filename1.fa UNK12345678@12345678@1
Please remove them from the mapping file or use different Name/ID/Version!
These functions requires makeblastdb for creating Blast DB for input gene sets. Please install that tool if it is missing.
For more info about the 2 python functions, please read their help menu:
addTaxon1s -h
or
addTaxa1s -h