-
Notifications
You must be signed in to change notification settings - Fork 2
Add new taxa to HaMStR
To add a new taxon to HaMStR, you need to follow its naming schema ([Species acronym]@[NCBI ID]@[Proteome version]) and place the necessary files in the correct folders:
- genome_dir (Contains sub-directories for proteome fasta files for each species)
- blast_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
- weight_dir (Contains feature annotation files for each proteome)
We simplify this process by providing 2 python scripts bin/addTaxonHamstr.py
and bin/addTaxaHamstr.py
.
Note: before using, please read the More section.
For this, you can use the bin/addTaxonHamstr.py
script:
python3 bin/addTaxonHamstr.py -f your_genome.fa -n abbr_tax_name -I tax_id -o /path/to/your/HaMStR -c
It will add a new folder named abbr_tax_name@tax_id@1
and the corresponding content into genome_dir and blast_dir , as well as a annotation abbr_tax_name@tax_id@1.json
file to weight_dir.
The header of new FASTA sequence, i.e. the sequence ID, will be the first word of the original FASTA sequence. Everything after the first whitespace will be removed. If the first word is duplicated between different sequences, an increasing index will be added to make sure that the sequence IDs of the new FASTA file are unique.
In most of the cases, you would need to add more than one taxon into HaMStR. For this purpose, the bin/addTaxonHamstr.py
script can be used:
python3 bin/addTaxaHamstr.py -i /path/to/taxa/fasta -m mapping_file -o /path/to/your/HaMStR -c
/path/to/taxa/fasta
is a folder where the FASTA files of all new taxa can be found. mapping_file
is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:
#filename tax_id abbr_tax_name version
filename1 12345678
filename2 9606
filename3 4932 my_fungi
...
The first line (started with #) is optional. The last 2 columns (abbr. taxon name and genome version) are also optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.
If the abbr. taxon name is not given, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid
(e.g. UNK12345678).
The script will check if the combination abbr_tax_name@tax_id@version
already exists in /path/to/your/HaMStR/genome_dir, it will give an error message and it need to be solved before continuing.
These taxa are probably already present in /Users/vinh/bionf/HaMStR/genome_dir:
filename.fa HUMAN@9606@3
filename1.fa UNK12345678@12345678@1
Please remove them from the mapping file or use different Name/ID/Version!
These python dependencies need to be installed (using python3 -m pip install library_name
):
- biopython
- ete3
These tools are also required:
- makeblastdb
- greedyFAS v ≥ 1.1.1
For more info about the 2 python scripts, please read their help menu:
python3 bin/addTaxonHamstr.py -h
or
python3 bin/addTaxaHamstr.py -h