Add new taxa to HaMStR

To add a new taxon to HaMStR-oneSeq, you need to follow its naming schema ([Species acronym]@[NCBI ID]@[Proteome version]) and place the necessary files in the correct folders:

genome_dir (Contains sub-directories for proteome fasta files for each species)
blast_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
weight_dir (Contains feature annotation files for each proteome)

We simplify this process by providing 2 functions addTaxon1s and addTaxa1s.

Note: before using, please read the More section.

Adding a single taxon

For this, you can use the addTaxon1s function:

addTaxon1s -f your_genome.fa -i tax_id -c [-o /output/directory] [-n abbr_tax_name]

If the abbr. taxon name is not given using the option -n abbt_tax_name, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid (e.g. UNK12345678).

The script will add a new folder named abbr_tax_name@tax_id@1 and the corresponding content into genome_dir and blast_dir , as well as a annotation abbr_tax_name@tax_id@1.json file to weight_dir. These 3 folders will be saved in /output/directory. If not specified, new taxon will be added into the same directory of pre-calculated data.

The header of new FASTA sequence, i.e. the sequence ID, will be the first word of the original FASTA sequence. Everything after the first whitespace will be removed. If the first word is duplicated between different sequences, an increasing index will be added to make sure that the sequence IDs of the new FASTA file are unique.

Example, a before fasta file:

>EXR66326.1 biofilm-associated domain protein, partial [Acinetobacter baumannii 339786]
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1 hypothetical protein J700_4015, partial [Acinetobacter baumannii 339786]
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...

and after (this is how your new sequence data will look like):

>EXR66326.1
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...

Adding a list of taxa

In most of the cases, you would need to add more than one taxon into HaMStR. For this purpose, the addTaxaHamstr function can be used:

addTaxaHamstr -i /path/to/taxa/fasta -m mapping_file -c [-o /output/directory]

/path/to/taxa/fasta is a folder where the FASTA files of all new taxa can be found. mapping_file is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:

#filename	tax_id	abbr_tax_name	version
filename.faa	9606
filename1.fa	12345678
filename2.fasta	4932	my_fungi
...

The header line (started with #) is a Must. The values of the last 2 columns (abbr. taxon name and genome version) are, however, optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.

If the abbr. taxon name is not given, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid (e.g. UNK12345678).

The script will check if the combination abbr_tax_name@tax_id@version already exists in /output/directory/genome_dir, it will give an error message and it need to be solved before continuing.

These taxa are probably already present in /Users/vinh/bionf/HaMStR/genome_dir:
	filename.faa	HUMAN@9606@3
	filename1.fa	UNK12345678@12345678@1
Please remove them from the mapping file or use different Name/ID/Version!