A pipeline for metagenomic assembly and analysis, that can also be used on Galaxy
An overview of the pipeline is given below:
The report describing MetaGalaxy and some benchmarks can be found Here
- Requirements
- Installation
- Example commands
- Full usage
- Method
- Interpeting output
- Acknowledgements
- License
- Linux
- Python 3.6
- Java
- Anaconda3
Install dependecies:
sudo apt-get install -y build-essential git qtbase5-dev libqt5svg5-dev libboost-all-dev curl libncurses5-dev zlib1g-dev pkg-config libfreetype6-dev libpng-dev cmake gcc g++ libbz2-dev liblzma-dev autoconf python-setuptools
Install MetaGalaxy:
git clone https://github.com/mdcjansen/MetaGalaxy
cd path/to/MetaGalaxy
conda env create -f environment.yml
chmod a+x bin/MetaGalaxy.py
Install the tools:
cd lib
chmod a+x install.sh
./install.sh
MetaGalaxy does not provide Guppy with its distribution. Guppy can be downloaded and install from the Oxford Nanopore technologies website
Install the databases:
cd data
chmod a+x install.sh
./install_database.py
It is possible to create your own databases for the tool as long as the folder structure is the same as the default folder structure of the tool. Building the databases may take several hours depending on the system configuration
Make MetaGalaxy executable from within the anaconda environment:
ln -s path/to/Metagalaxy/bin/MetaGalaxy.py path/to/anaconda3/envs/metagalaxy/bin/MetaGalaxy
Some tools and depencecies used by MetaGalaxy are installed in the conda environment. This environment must be activated, to ensure that the analysis runs smoothly.
conda activate metagalaxy
This will run MetaGalaxy with the default parameters
metagalaxy -i <input_file> -o <output_directory>
The first command will print all the barcoding kits available for demultiplexing. The second command will start the demultiplexing process.
metagalaxy --bc_avail
metagalaxy --demulitplex --bc_kit <barcoding_kit> -i <input_directory> -o <output_directory>
This will run MetaGalaxy as normal, but it will not clean up the additional files created during analysis
metagalaxy -i <input_file> -o <output_directory> -t <threads> --keep
usage: MetaGalaxy -i <inputfile> [options]
MetaGalaxy is designed to identify bacteria from metagenomic samples and
detect their AMR genes. It uses basecalled nanopore data in fastq format.
Ensure that the conda environment is activated before using this pipeline
[conda activate metagalaxy].
Arguments for Metagalaxy:
-h, --help show this help message and exit
-v, --version Prints program version and exits Metagalaxy
-i [input] Input .fastq file for analysis or file directory for
demultiplexing
-o [output] Output directory
-t [threads] Amount of threads [56]
-g [gsize] Esitmated genome size [100m]
--demultiplex [] MetaGalaxy will demultiplex the files from the specified
input directory and trim the barcodes
--bc_kit [] Specify the barcoding kit for demultiplexing. Only
required when used in conjunction with --demultiplex
--bc_avail [] Prints all the available barcoding kits for demultiplexing
and exits Metagalaxy
--keep [] Keep all files produced by MetaGalaxy
Thank you for using MetaGalaxy!
Metagalaxy will carry out the following steps during analysis:
- Metagalaxy will perform taxonomic classification on the raw input file * Taxonomic classification is performed by Kraken2 * Classification is performed with the use of a protein and nucleotide database
- Simultaneously, Metagalaxy will asses the quality of the reads and filter out the short reads * The quality of the reads is calculated by NanoPlot * The reads shorter than 1Kbp or below a quality score of 7 are filtered out by Filtlong
- After filtering on quality, plasmid reads are isolated an analysed * Plasmid reads are identified with ABRicate and the PlasmidFinder database. * These reads are then extracted from the main FASTQ file with a python script * ABRicate is used to screen the reads for AMR genes
- After the reads are filtered, Metagalaxy will assemble the reads and polish the assembly * Assembly is performed by Flye with the --meta option * The first four rounds of polishing are performed by Racon * The last polishing round is performed by Medaka
- After assembly, there is an assesment of the quality of the assembly and the assembly will also be visualised * The quality assessment is performed by Quast * The assembly graph is visualised by Bandage
- The assembly will be binned, the bins are then taxonomicially classified and the bins will be screened for antimicrobial resistance genes * The binning is performed by MetaBAT * The taxonomic classification is performed by CAT/BAT * The screening for antimicrobial resistance genes is done by ABRicate, which uses the ResFinder database for the screening
Metagalaxy will produce numerous output files for all the individual analysis. Here, all the outputs are explained. The files that are by default cleaned up, are not explained in great detail.
The first step will produce the folder "raw_read_quality" which contains four output files, two for each database used. These files are named: raw_taxonomy_genus_.txt or raw_taxonomy_species_.txt and have either the suffix ndb or pdb, which specifies if the nucleotide database or protein database has been used to generate the results. The files look like this:
Coverage in Percentage | Covered fragments | Assigned fragments | Rank code | NCBI ID | Scientific name |
---|---|---|---|---|---|
86.62 | 1120718 | 2668 | G | 561 | Escherichia |
3.85 | 49777 | 14 | G | 1350 | Enterococcus |
6.50 | 84085 | 0 | G | 9605 | Homo |
Where the percentage is the percentage of fragments covered by the clade at the specific taxon, and the covered fragments are the number of fragments covered by the clade. The assigned fragments are the number of fragments directly assigned to the taxon. The rank code specifies the taxonomic rank. The NCBI ID is the NCBI taxonomic ID number.
Aside from the text files, two Krona charts will be made as well. One chart displays the taxonomies found in the nucleotide database and the other chart displays the taxonomies found in the protein database. The chart shows the abundance of the predicted taxon in percentage
NanoPlot is used to assess the quality of the raw and filtered reads. The results of the analysis are stored in the "raw_read_quality" and "filtered_read_quality" directory.
This tool produces a html file which can be viewed in any modern browser (Edge, Chrome, Firefox). The html file contains a statistical summary on the reads which includes data on the mean and median read length and quality, N50, total bases and top 5 quality reads and read length. The summary can also be viewed in the text file included in the same directory.
The report also includes multiple histograms which display the read length before and after log transformation; and read length vs read quality plots.
The filtered reads are put in a new .fastq file which has been named filtered.fastq
After the reads have been filtered by Filtlong, the FASTQ file is screened for plasmid reads by ABRicate. A reads is considered to be from a plasmid when ABRicate finds a hit against the PlasmidFinder database with a minimum coverage and identity of 60%. These reads will then be extracted from the filtered.fastq file by a python script. After extraction these reads are screened again for AMR genes. The output of this AMR screening looks similar to the table shown (below)[####Binning]. There is one main difference between this screening and the AMR screening done later on the bins. When the plasmid reads are screened, ABRicate will a gene everytime it is found. This means that when a particular gene is covered ten times, the output file will show that particular gene ten times. When the screening is done on the bins. ABRicate will only report once on any found gene, unless there are duplicates present in the bin.
Flye produces multiple files including the final assembly in .fasta format and an assembly graph in .gfa format. The final assembly is used as input for mapping with minimap2, and polishing with Racon and Medaka. The final consensus that is produced by Medaka, is used by Quast.
Quast produces an html file that when viewed will show a detailed report on the quality of the assembly. Some of the reported results are: Genome fraction (%), NGA50, N50, number of contigs, largest contig size and total contig length.
The assembly graph is used by Bandage to create a .jpg image in which the assembled contigs are visualised.
The bins discovered by MetaBAT saved as FASTA format in the "meta_bins" folder. Individual bins are taxonomically classified and summarized by CAT/BAT. The summary is saved as "taxonomy_bins.txt" and is a tab-delimited file, which contains the name of the bin, whether or not it has been classified, the lineage, the lineage score and the taxonomy from the superkingdom downwards. An example is shown below:
bin | classification | reason | lineage | lineage scores | superkingdom | phylum | class | order | family | genus | species |
---|---|---|---|---|---|---|---|---|---|---|---|
bin.1.fa | classified | based on 5466/5482 ORFs | 1;131567;2;1224;1236;91347;543;561;562 | 1.00;0.93;0.81;0.25;0.21;0.20;0.20;0.10;0.08 | Bacteria: 0.81 | Proteobacteria: 0.25 | Gammaproteobacteria: 0.21 | Enterobacterales: 0.20 | Enterobacteriaceae: 0.20 | Escherichia: 0.10 | Escherichia coli: 0.08 |
bin.2.fa | classified | based on 4014/4056 ORFs | 1;131567;2;1783272;1239;91061;186826;81852;1350;1351 | 1.00;1.00;0.99;0.11;0.08;0.08;0.07;0.07;0.07;0.05 | Bacteria: 0.99 | Firmicutes: 0.08 | Bacilli: 0.08 | Lactobacillales: 0.07 | Enterococcaceae: 0.07 | Enterococcus: 0.07 | Enterococcus faecalis: 0.05 |
bin.lowDepth.fa | classified(1/4) | based on 407/453 ORFs | 1;131567;2;1783270;68336;976;200643;171549;171551;836;2645799;1739529 | 1.00;1.00;1.00;0.24;0.24;0.24;0.24;0.24;0.22;0.21;0.06;0.06 | Bacteria: 1.00 | Bacteroidetes: 0.24 | Bacteroidia: 0.24 | Bacteroidales: 0.24 | Porphyromonadaceae: 0.22 | Porphyromonas: 0.21 | Porphyromonas sp. HMSC077F02: 0.06 |
bin.lowDepth.fa | classified(2/4) | based on 407/453 ORFs | 1;131567;2;1783272;1239;91061;186826;186827;66831;137730 | 1.00;1.00;1.00;0.39;0.35;0.35;0.35;0.34;0.33;0.33 | Bacteria: 1.00 Firmicutes: 0.35 | Bacilli: 0.35 | Lactobacillales: 0.35 | Aerococcaceae: 0.34 | Facklamia: 0.33 | Facklamia ignava: 0.33 | |
bin.lowDepth.fa | classified(3/4) | based on 407/453 ORFs | 1;131567;2;1224;68525;29547;213849;72294;194;827 | 1.00;1.00;1.00;0.16;0.16;0.16;0.16;0.15;0.15;0.14 | Bacteria: 1.00 Proteobacteria: 0.16 | Epsilonproteobacteria: 0.16 | Campylobacterales: 0.16 | Campylobacteraceae: 0.15 | Campylobacter: 0.15 | Campylobacter ureolyticus: 0.14 | |
bin.lowDepth.fa | classified(4/4) | based on 407/453 ORFs | 1;131567;2;1783272;201174;1760;2037;2049;1069494 | 1.00;1.00;1.00;0.39;0.03;0.03;0.03;0.03;0.03 | Bacteria: 1.00 | Actinobacteria: 0.03 | Actinobacteria: 0.03 | Actinomycetales: 0.03 | Actinomycetaceae: 0.03 | Trueperella: 0.03 | not classified |
bin.tooShort.fa | classified | based on 2/4 ORFs | 1;131567;2759;33630;5794;422676;5819;1639119;5820;418103;36330;864142 | 1.00;1.00;0.52;0.52;0.52;0.52;0.52;0.52;0.52;0.52;0.52;0.52 | Eukaryota: 0.52 | Apicomplexa: 0.52 | Aconoidasida: 0.52 | Haemosporida: 0.52 | Plasmodiidae: 0.52 | Plasmodium: 0.52 | Plasmodium ovale: 0.52 |
Aside from the tab-delimited file, a Krona chart is created from the taxonomic assignment. The chart shows which organisms have been identified across all the bins. The abundance of the organisms is displayed in percentage
Antimicrobial resistance (AMR) genes are predicted by ABRicate and it produces two tab delimited files: "amr_results.tab" and "amr_summary.tab".
The first file looks like this:
FILE | SEQUENCE | START | END | STRAND | GENE | COVERAGE | COVERAGE_MAP | GAPS | %COVERAGE | %IDENTITY | DATABASE | ACCESSION | PRODUCT | RESISTANCE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bin.1.fa | contig_2:1.0-4689146.0 | 6346 | 7578 | - | mdf(A)_1 | 1-1233/1233 | =============== | 0/0 | 100.00 | 97.97 | resfinder | Y08743 | mdf(A) | |
bin.1.fa | contig_2:1.0-4689146.0 | 549616 | 549971 | - | tet(34)_1 | 66-421/465 | .=======/=====. | 2/2 | 76.34 | 74.79 | resfinder | AB061440 | tet(34) | |
bin.1.fa | contig_2:1.0-4689146.0 | 1096532 | 1097632 | + | blaCFE-1_1 | 37-1146/1146 | ========/====== | 5/19 | 95.64 | 71.66 | resfinder | AB107899 | blaCFE-1 | |
bin.1.fa | contig_2:1.0-4689146.0 | 3007312 | 3007616 | - | oqxB_1 | 2660-2964/3153 | ......../...=== | 2/6 | 9.58 | 68.18 | resfinder | EU370913 | oqxB | Nalidixic_acid;Ciprofloxacin |
bin.2.fa | contig_18:1.0-2132224.0 | 101057 | 102558 | + | lsa(A)_1 | 1-1497/1497 | ========/====== | 5/5 | 100.00 | 98.94 | resfinder | AY225127 | lsa(A) | Lincomycin;Clindamycin;Dalfopristin;Pristinamycin_IIA;Virginiamycin_M |
bin.2.fa | contig_18:1.0-2132224.0 | 855358 | 855596 | + | tetB(60)_1 | 1444-1682/1740 | ......../...=== | 8/16 | 13.28 | 71.66 | resfinder | KX000273 | tetB(60) | Doxycycline;Tetracycline;Tigecycline |
With each of the columns displaying this information:
- The name of the bin
- The sequence within the bin that contains the amr gene
- The start coordinate of the identified gene
- The end coordinate of the identified gene
- On which strand the gene is located.
- Forward strand (+)
- Reverse strand (-)
- Gene name
- Proportion of the gene in the input sequence
- Visual representation
=
=aligned,.
=unaligned,/
=has gap - Openings/gaps in subject and query
- Gene coverage in percentage
- Nucleotide coverage in percentage
- Database used to obtain the gene sequence
- Genomic source of the sequence
- Gene product
- Putative antibiotic resistace phenotype
The summary file, summarises all found genes into a single file, which looks like this:
FILE | NUM_FOUND | blaCFE-1_1 | lsa(A)_1 | mdf(A)_1 | oqxB_1 | tet(34)_1 | tetB(60)_1 |
---|---|---|---|---|---|---|---|
bin.1.fa | 4 | 95.64 | . | 100.00 | 9.58 | 76.34 | . |
bin.2.fa | 2 | . | 100.00 | . | . | 13.28 |
Where the first column displays the name of the bin; the second column displays how many amr genes were found in a specific bin; and all following columns display the coverage of each amr gene found in a bin in percentage. When a bin does not contain a specific amr gene, a "." is placed in that cell.
Metagalaxy uses the following tools in the pipeline: