Add new taxa to HaMStR

To add a new taxon to HaMStR, you need to follow its naming schema ([Species acronym]@[NCBI ID]@[Proteome version]) and place the necessary files in the correct folders:

genome_dir (Contains sub-directories for proteome fasta files for each species)
blast_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
weight_dir (Contains feature annotation files for each proteome)

We simplify this process by providing 2 python scripts bin/addTaxonHamstr.py and bin/addTaxaHamstr.py.

Note: before using, please read the More section.

Adding a single taxon

For this, you can use the bin/addTaxonHamstr.py script:

python3 bin/addTaxonHamstr.py -f your_genome.fa -n abbr_tax_name -I tax_id -o /path/to/your/HaMStR -c

It will add a new folder named abbr_tax_name@tax_id@1 and the corresponding content into genome_dir and blast_dir , as well as a annotation abbr_tax_name@tax_id@1.json file to weight_dir.

The header of new FASTA sequence, i.e. the sequence ID, will be the first word of the original FASTA sequence. Everything after the first whitespace will be removed. If the first word is duplicated between different sequences, an increasing index will be added to make sure that the sequence IDs of the new FASTA file are unique.

Adding a list of taxa

In most of the cases, you would need to add more than one taxon into HaMStR. For this purpose, the bin/addTaxonHamstr.py script can be used:

python3 bin/addTaxaHamstr.py -i /path/to/taxa/fasta -m mapping_file -o /path/to/your/HaMStR -c

/path/to/taxa/fasta is a folder where the FASTA files of all new taxa can be found. mapping_file is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:

#filename	tax_id	abbr_tax_name	version
filename1	12345678
filename2	9606
filename3	4932	my_fungi
...

The first line (started with #) is optional. The last 2 columns (abbr. taxon name and genome version) are also optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.

If the abbr. taxon name is not given, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid (e.g. UNK12345678).

The script will check if the combination abbr_tax_name@tax_id@version already exists in /path/to/your/HaMStR/genome_dir, it will give an error message and it need to be solved before continuing.

These taxa are probably already present in /Users/vinh/bionf/HaMStR/genome_dir:
	filename.fa	HUMAN@9606@3
	filename1.fa	UNK12345678@12345678@1
Please remove them from the mapping file or use different Name/ID/Version!