GraphBin2 is an extension of GraphBin which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species.
Note: Due to recent requests from the community, we have added support for long-read assemblies produced from Flye. Please note that GraphBin2 has not been tested extensively on long-read assemblies. We originally developed GraphBin2 for short-read assemblies. Long-read assemblies might have sparsely connected graphs which can make the label propagation process less effective and may not result in improvements.
You will need the following python packages installed.
You can download the latest release of GraphBin2 from Releases or clone the GraphBin2 repository to your machine.
git clone https://github.com/Vini2/GraphBin2.git
If you have downloaded a release, you will have to extract the files using the following command.
unzip [file_name].zip
Now go in to the GraphBin2 folder using the command
cd GraphBin2/
We recommend that you use Conda to run GraphBin2. You can download Anaconda or Miniconda which contains Conda.
Once you have installed Conda, make sure you are in the GraphBin2 folder. Now run the following commands to create a Conda environment and activate it to run GraphBin2.
conda env create -f environment.yml
conda activate graphbin2
Now you are ready to run GraphBin2.
If you want to switch back to your normal environment, run the following command.
conda deactivate
Firstly, you will have to assemble your set of reads into contigs. For this purpose, you can use metaSPAdes or SGA.
SPAdes is an assembler based on the de Bruijn graph approach. metaSPAdes is the dedicated metagenomic assembler of SPAdes. Use metaSPAdes (SPAdes in metagenomics mode) software to assemble reads into contigs. A sample command is given below.
spades --meta -1 Reads_1.fastq -2 Reads_2.fastq -o /path/output_folder -t 16
SGA (String Graph Assembler) is an assembler based on the overlap-layout-consensus (more recently string graph) approach. Use SGA software to assemble reads into contigs. Sample commands are given below. You may change the parameters to suit your datasets.
sga preprocess -o reads.fastq --pe-mode 1 Reads_1.fastq Reads_2.fastq
sga index -a ropebwt -t 16 --no-reverse reads.fastq
sga correct -k 41 --learn -t 16 -o reads.k41.fastq reads.fastq
sga index -a ropebwt -t 16 reads.k41.fastq
sga filter -x 2 -t 16 reads.k41.fastq
sga fm-merge -m 45 -t 16 reads.k41.filter.pass.fa
sga index -t 16 reads.k41.filter.pass.merged.fa
sga overlap -m 55 -t 16 reads.k41.filter.pass.merged.fa
sga assemble -m 95 reads.k41.filter.pass.merged.asqg.gz
Next, you have to bin the resulting contigs using an existing contig-binning tool. We have used the following tools with their commands for the experiments.
perl MaxBin-2.2.5/run_MaxBin.pl -contig contigs.fasta -abund abundance.abund -thread 8 -out /path/output_folder
python scripts/gen_kmer.py /path/to/data/contig.fasta 1000 4
sh gen_cov.sh
python SolidBin.py --contig_file /path/to/contigs.fasta --composition_profiles /path/to/kmer_4.csv --coverage_profiles /path/to/cov_inputtableR.tsv --output /output/result.tsv --log /output/log.txt --use_sfs
You can see the usage options of GraphBin2 by typing ./graphbin2 -h
on the command line. For example,
usage: graphbin2 [-h] --assembler ASSEMBLER --graph GRAPH --contigs CONTIGS
[--paths PATHS] [--abundance ABUNDANCE] --binned BINNED
--output OUTPUT [--prefix PREFIX] [--depth DEPTH]
[--threshold THRESHOLD] [--delimiter DELIMITER]
[--nthreads NTHREADS]
GraphBin2 Help. GraphBin2 is a tool which refines the binning results obtained
from existing tools and, more importantly, is able to assign contigs to
multiple bins. GraphBin2 uses the connectivity and coverage information from
assembly graphs to adjust existing binning results on contigs and to infer
contigs shared by multiple species.
optional arguments:
-h, --help show this help message and exit
--assembler ASSEMBLER
name of the assembler used (SPAdes, SGA or Flye)
--graph GRAPH path to the assembly graph file
--contigs CONTIGS path to the contigs file
--paths PATHS path to the contigs.paths file
--abundance ABUNDANCE
path to the abundance file
--binned BINNED path to the .csv file with the initial binning output
from an existing tool
--output OUTPUT path to the output folder
--prefix PREFIX prefix for the output file
--depth DEPTH maximum depth for the breadth-first-search. [default:
5]
--threshold THRESHOLD
threshold for determining inconsistent vertices.
[default: 1.5]
--delimiter DELIMITER
delimiter for input/output results. Supports a comma
(,), a semicolon (;), a tab ($'\t'), a space (" ") and
a pipe (|) [default: , (comma)]
--nthreads NTHREADS number of threads to use. [default: 8]
The SPAdes version of graphbin2.py
takes in 4 files as inputs (required).
- Contigs file (in
.fasta
format) - Assembly graph file (in
.gfa
format) - Paths of contigs (in
.paths
format) - Binning output from an existing tool (in
.csv
format)
The SGA version of graphbin2.py
takes in 4 files as inputs (required).
- Contigs file (in
.fasta
format) - Abundance file (tab separated file with contig ID and coverage in each line)
- Assembly graph file (in
.asqg
format) - Binning output from an existing tool (in
.csv
format)
The Flye version of graphbin2.py
takes in 4 files as inputs (required).
- Contigs file (in
.fasta
format) - Abundance file (tab separated file with contig ID and coverage in each line)
- Assembly graph file (in
.gfa
format) - Binning output from an existing tool (in
.csv
format)
Note: You can specify the delimiter for the initial binning result file and the final output file using the delimiter
paramter. Enter the following values for different delimiters; ,
for a comma, ;
for a semicolon, $'\t'
for a tab, " "
for a space and |
for a pipe.
Note: The abundance file (e.g., abundance.abund
) is a tab separated file with contig ID and the coverage for each contig in the assembly. metaSPAdes provides the coverage of each contig in the contig identifier of the final assembly. We can directly extract these values to create the abundance.abund file. However, no such information is provided for contigs produced by SGA. Hence, reads should be mapped back to the assembled contigs in order to determine the coverage of SGA contigs.
Note: Make sure that the initial binning result consists of contigs belonging to only one bin. GraphBin2 is designed to handle initial contigs which belong to only one bin.
Note: The binning output file should have comma separated values (contig_identifier, bin_number)
for each contig. The contents of the binning output file should look similar to the example given below. Contigs are named according to their original identifier and the numbering of bins starts from 1.
Example metaSPAdes binned input
NODE_1_length_507141_cov_16.465306,1
NODE_2_length_487410_cov_94.354557,1
NODE_3_length_483145_cov_59.410818,1
NODE_4_length_468490_cov_20.967912,2
NODE_5_length_459607_cov_59.128379,2
...
Example SGA binned input
contig-0,1
contig-1,2
contig-2,1
contig-3,1
contig-4,2
...
Example Flye binned input
edge_1,1
edge_2,2
edge_3,1
edge_4,1
edge_5,2
...
You can use the prepResult.py
script to format an initial binning result in to the .csv format with contig identifiers and bin ID. Further details can be found here.
Before using Flye assemblies for binning, please use the gfa2fasta.py
script to get the edge sequences. Further details can be found here.
python graphbin2.py --assembler spades --contigs /path/to/contigs.fasta --graph /path/to/graph_file.gfa --paths /path/to/paths_file.paths --binned /path/to/binning_result.csv --output /path/to/output_folder
python graphbin2.py --assembler sga --contigs /path/to/contigs.fa --abundance /path/to/abundance.tsv --graph /path/to/graph_file.asqg --binned /path/to/binning_result.csv --output /path/to/output_folder
python graphbin2.py --assembler flye --contigs /path/to/edges.fasta --abundance /path/to/abundance.tsv --graph /path/to/graph_file.gfa --binned /path/to/binning_result.csv --output /path/to/output_folder
[1] Barnum, T.P., et al.: Genome-resolved metagenomics identifies genetic mobility, metabolic interactions, and unexpected diversity in perchlorate-reducing communities. The ISME Journal 12, 1568-1581 (2018)
[2] Mallawaarachchi, V., Wickramarachchi, A., Lin, Y.: GraphBin: Refined binning of metagenomic contigs using assembly graphs. Bioinformatics, btaa180 (2020)
[3] Nurk, S., et al.: metaSPAdes: a new versatile metagenomic assembler. Genome Researcg 5, 824-834 (2017)
[4] Simpson, J. T. and Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22(3), 549–556 (2012).
[5] Wang, Z., et al.: SolidBin: improving metagenome binning withsemi-supervised normalized cut. Bioinformatics 35(21), 4229–4238 (2019).
[6] Wu, Y.W., et al.: MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2(1), 26 (2014)
[7] Wu, Y.W., et al.: MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2016)
GraphBin2 has been accepted for publication at the 20th International Workshop on Algorithms in Bioinformatics (WABI 2020) and is published in Leibniz International Proceedings in Informatics (LIPIcs) DOI: 10.4230/LIPIcs.WABI.2020.8. If you use GraphBin2 in your work, please cite GraphBin2 as follows.
@InProceedings{mallawaarachchi_et_al:LIPIcs:2020:12797,
author = {Vijini G. Mallawaarachchi and Anuradha S. Wickramarachchi and Yu Lin},
title = {{GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs}},
booktitle = {20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
pages = {8:1--8:21},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-161-0},
ISSN = {1868-8969},
year = {2020},
volume = {172},
editor = {Carl Kingsford and Nadia Pisanti},
publisher = {Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/12797},
URN = {urn:nbn:de:0030-drops-127974},
doi = {10.4230/LIPIcs.WABI.2020.8},
annote = {Keywords: Metagenomics binning, contigs, assembly graphs, overlapped binning}
}