Dr. Shan Zhang (link)1 and Dr. Weizhi Song (link)2
1 Department of Pharmacology and Pharmacy, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong
2 Department of Ocean Science, Hong Kong University of Science and Technology, Hong Kong
Email: shanbio@hku.hk, ocessongwz@ust.hk
-
The input genomes to MetaCHIP2 must be in GenBank format. If your genomes are currently in FASTA format, you'll need to perform an initial annotation step before feeding them to MetaCHIP2. You can use
MetaCHIP2 prokkamodule to batch generate the .gbk files for your input genomes.This pre-annotation strategy could:- bypass the need for repeated genome annotation when exploring MetaCHIP2 parameters, thereby reducing computational time.
- minimize the introduction of variations from the annotation process itself, and thus
- ensure better comparability of predictions between independent MetaCHIP2 runs (on the same set of input genomes).
-
The user now need to provide a species tree for the input genome. Again, this could avoid repeated tree inference, which in turn leads to more consistent and comparable predictions between separate MetaCHIP2 runs on the same set of input genomes. You can use
MetaCHIP2 treemodule to infer the species tree, This module wraps GTDB-Tk'sidentify,align, andinferfunctionalities. -
The inferred species tree must be rooted, as required by Ranger-DTL (one of MetaCHIP2's dependency). If you use
MetaCHIP2 treemodule for tree inference, the tree will be automatically rooted according to the GTDB taxonomy. If you use your own way to get the species tree, please make sure that it is properly rooted. -
The
PIandBPmodules in MetaCHIP has now been merged into a single module calleddetectin MetaCHIP2. -
You can now use
mmseqs linclust(by specifying '-m' to theMetaCHIP2 detectmodule) to speed up the time-consuming all-vs-all blastn step in MetaCHIP2. -
The output files are now organized in a more intuitively way, making them easier to understand.
-
More detailed interpretation of the donor gene/genome, and the often-observed low similarities between the donor and recipient genes. Please see details below in the "Output files" section.
-
A changelog is here.
-
Python libraries: BioPython, Numpy, SciPy, Matplotlib and ETE3.
-
Third-party software: GTDB-Tk, BLAST+, MMseqs2, MAFFT, Ranger-DTL 2.0 and FastTree.
-
As MetaCHIP2 requires GTDB-Tk, we'll create a Conda environment pre-installed with GTDB-Tk. You'll need to setup the database files for GTDB-Tk as described in its manual.
conda create -n metachip2env -c conda-forge -c bioconda gtdbtk=2.7.1 conda activate metachip2env pip install MetaCHIP2 conda install -c bioconda blast conda install -c bioconda mafft conda install -c bioconda mmseqs2 conda install -c bioconda diamond conda install -c conda-forge r-base conda install -c conda-forge legacy-cgi -
Upgrade with:
pip3 install --upgrade MetaCHIP2
-
The input files for MetaCHIP2 include a folder that holds the gbk file of all query genomes, as well as a text file which provides taxonomic classification (example) or customized grouping (example) of the input genomes. File extension (e.g., gbk) of the input genomes should NOT be included in the taxonomy or grouping file.
-
GTDB-Tk is recommended for taxonomic classification of input genomes. Only the first two columns (user_genome and classification) are needed.
-
Input files for MetaCHIP2 must be in GenBank format. You can run
MetaCHIP2 prokka -hto batch generate the .gbk files for your input genomes. To prevent potential Prokka errors, please ensure that contig IDs remain shorter than 18 characters. -
The user now need to provide a species tree for the input genome. You can run
MetaCHIP2 tree -hto infer the species tree, which wraps GTDB-Tk'sidentify,align, andinferfunctionalities. The inferred species tree must be rooted, as required by Ranger-DTL (one of MetaCHIP2's dependency). If you useMetaCHIP2 treefor tree inference, the tree is automatically rooted according to the GTDB taxonomy. If all genomes on the species tree are from the same genus,MetaCHIP2 treewill root it at middle point. If you use your own way to get the species tree, please make sure that it is properly rooted. -
Now you are ready to detect HGTs among your input genomes.
MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r pcofg -
You can use
mmseqs linclust(by specifying '-m' todetectmodule) to speed up the time-consuming all-vs-all blastn step.MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r pcofg -m -
If you already have the all-vs-all blastn results on the same set of input genomes from a previous run, you can skip the blastn by providing the blastn results with '-b'.
MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r p -b path/to/previous/run/blastn_op -
Options for argument '-r' can be any combinations of d (domain), p (phylum), c (class), o (order), f (family), g (genus) and s(species):
MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r pcofg MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r pco MetaCHIP2 detect -i gbk_dir -x gbk -c taxon.tsv -s rooted.tree -t 12 -f -o op_dir -r ofg
-
A Tab delimited text file (detected_HGTs.txt) containing all identified HGTs.
Column Description Gene_1 [1] The 1st gene involved in a HGT event Gene_2 [1] The 2nd gene involved in a HGT event Identity [2] Identity between Gene_1 and Gene_2 Occurence(taxon_ranks) Only for multiple-level detections. If you performed HGT detection at phylum, class and order levels, a number of "011" means current HGT was identified at class and order levels, but not phylum level. End_match End match or not (see examples below) Full_length_match Full length match or not (see examples below) Direction [3] The direction of gene flow. Number in parenthesis refers to the percentage of this direction being observed if this HGT was detected at multiple ranks and different directions were provided by Ranger-DTL. [1] The most accurate interpretation of the "donor gene" is "the gene from the donor group of your input genomes that exhibits the highest similarity to the recipient gene".
[2] A low similarity does not necessarily indicate it's an ancient gene transfer. Instead, it more likely reflects the absence of the exact donor organism (the organism that physically contributed the transferred gene) in the input genomes.
[3] Similar to the interpretation in [1], the donor genome is the genome within the donor group that contains the gene exhibiting the highest similarity to the recipient gene.
-
Nucleotide and amino acid sequences of identified HGTs.
-
Flanking regions of identified HGTs. Genes encoded on the forward strand are displayed in light blue, and genes coded on the reverse strand are displayed in light green. The name of genes predicted to be HGT are highlighted in blue, large font with pairwise identity given in parentheses. Contig names are provided at the left bottom of the sequence tracks and numbers following the contig name refer to the distances between the gene subject to HGT and either the left or right end of the contig. Red bars show similarities of the matched regions between the contigs based on BLASTN results.

-
Gene flow between groups. Bands connect donors and recipients, with the width of the band correlating to the number of HGTs and the colour corresponding to the donors, band arrow points to the recipient.

If you want to visualize gene flow for a subset of detected HGTs (e.g., HGTs belong to a specific functional group), you can subset the "detected_HGTs.txt" to keep only the interested HGTs and run the
circosmodule. The grouping file is in MetaCHIP2's output directory.MetaCHIP2 circos -l detected_HGTs_subset.txt -g grouping.txt -o interested_HGT_circos_plot.pdf -
Enrichment of COG functions in the detected HGTs (to produce a plot similar to Fig. 9 in the MetaCHIP paper)
MetaCHIP2 enrich -f -diamond -t 12 -faa faa_dir -o op_dir -db path/to/COG_db -hgt1 detected_HGTs.faa MetaCHIP2 enrich -f -diamond -t 12 -faa faa_dir -o op_dir -db path/to/COG_db -hgt1 Setting1_HGTs.faa -hgt2 Setting2_HGTs.faa -label1 Setting1 -label2 Setting2