Skip to content

metagentools/GraphBin2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphBin2 Logo

GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs

GitHub GitHub top language GitHub top language

GraphBin2 is an extension of GraphBin which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species.

Getting Started

Dependencies

You will need the following python packages installed.

Downloading GraphBin2

You can download the latest release of GraphBin2 from Releases or clone the GraphBin2 repository to your machine.

git clone https://github.com/Vini2/GraphBin2.git

If you have downloaded a release, you will have to extract the files using the following command.

unzip [file_name].zip

Now go in to the GraphBin2 folder using the command

cd GraphBin2/

Setting up the environment

We recommend that you use Conda to run GraphBin2. You can download Anaconda or Miniconda which contains Conda.

Once you have installed Conda, make sure you are in the GraphBin2 folder. Now run the following commands to create a Conda environment and activate it to run GraphBin2.

conda env create -f environment.yml
conda activate graphbin2

Now you are ready to run GraphBin2.

If you want to switch back to your normal environment, run the following command.

conda deactivate

Preprocessing

Firstly, you will have to assemble your set of reads into contigs. For this purpose, you can use metaSPAdes or SGA.

metaSPAdes

SPAdes is an assembler based on the de Bruijn graph approach. metaSPAdes is the dedicated metagenomic assembler of SPAdes. Use metaSPAdes (SPAdes in metagenomics mode) software to assemble reads into contigs. A sample command is given below.

spades --meta -1 Reads_1.fastq -2 Reads_2.fastq -o /path/output_folder -t 16

SGA

SGA (String Graph Assembler) is an assembler based on the overlap-layout-consensus (more recently string graph) approach. Use SGA software to assemble reads into contigs. Sample commands are given below. You may change the parameters to suit your datasets.

sga preprocess -o reads.fastq --pe-mode 1 Reads_1.fastq Reads_2.fastq
sga index -a ropebwt -t 16 --no-reverse reads.fastq
sga correct -k 41 --learn -t 16 -o reads.k41.fastq reads.fastq
sga index -a ropebwt -t 16 reads.k41.fastq
sga filter -x 2 -t 16 reads.k41.fastq
sga fm-merge -m 45 -t 16  reads.k41.filter.pass.fa
sga index -t 16 reads.k41.filter.pass.merged.fa
sga overlap -m 55 -t 16 reads.k41.filter.pass.merged.fa
sga assemble -m 95 reads.k41.filter.pass.merged.asqg.gz

Next, you have to bin the resulting contigs using an existing contig-binning tool. We have used the following tools with their commands for the experiments.

perl MaxBin-2.2.5/run_MaxBin.pl -contig contigs.fasta -abund abundance.abund -thread 8 -out /path/output_folder
python scripts/gen_kmer.py /path/to/data/contig.fasta 1000 4 
sh gen_cov.sh 
python SolidBin.py --contig_file /path/to/contigs.fasta --composition_profiles /path/to/kmer_4.csv --coverage_profiles /path/to/cov_inputtableR.tsv --output /output/result.tsv --log /output/log.txt --use_sfs

Using GraphBin2

You can see the usage options of GraphBin2 by typing ./graphbin2 -h on the command line. For example,

usage: graphbin2 [-h] --assembler ASSEMBLER --graph GRAPH --contigs CONTIGS                 
                 [--paths PATHS] [--abundance ABUNDANCE] --binned BINNED                 
                 --output OUTPUT [--prefix PREFIX] [--depth DEPTH]
                 [--threshold THRESHOLD] [--delimiter DELIMITER]
                 [--nthreads NTHREADS]

GraphBin2 Help. GraphBin2 is a tool which refines the binning results obtained
from existing tools and, more importantly, is able to assign contigs to
multiple bins. GraphBin2 uses the connectivity and coverage information from
assembly graphs to adjust existing binning results on contigs and to infer
contigs shared by multiple species.

optional arguments:
  -h, --help            show this help message and exit
  --assembler ASSEMBLER
                        name of the assembler used (SPAdes or SGA)
  --graph GRAPH         path to the assembly graph file
  --contigs CONTIGS     path to the contigs file
  --paths PATHS         path to the contigs.paths file
  --abundance ABUNDANCE
                        path to the abundance file
  --binned BINNED       path to the .csv file with the initial binning output
                        from an existing tool
  --output OUTPUT       path to the output folder
  --prefix PREFIX       prefix for the output file
  --depth DEPTH         maximum depth for the breadth-first-search. [default:
                        5]
  --threshold THRESHOLD
                        threshold for determining inconsistent vertices.
                        [default: 1.5]
  --delimiter DELIMITER
                        delimiter for input/output results. Supports a comma
                        (,), a semicolon (;), a tab ($'\t'), a space (" ") and
                        a pipe (|) [default: , (comma)]
  --nthreads NTHREADS   number of threads to use. [default: 8]

Input Format

For the SPAdes version of graphbin2.py takes in 4 files as inputs (required).

  • Contigs file (in .fasta format)
  • Assembly graph file (in .gfa format)
  • Paths of contigs (in .paths format)
  • Binning output from an existing tool (in .csv format)

For the SGA version of graphbin2.py takes in 4 files as inputs (required).

  • Contigs file (in .fasta format)
  • Abundance file (tab separated file with contig ID and coverage in each line)
  • Assembly graph file (in .asqg format)
  • Binning output from an existing tool (in .csv format)

Note: You can specify the delimiter for the initial binning result file and the final output file using the delimiter paramter. Enter the following values for different delimiters; , for a comma, ; for a semicolon, $'\t' for a tab, " " for a space and | for a pipe.

Note: The abundance file (e.g., abundance.abund) is a tab separated file with contig ID and the coverage for each contig in the assembly. metaSPAdes provides the coverage of each contig in the contig identifier of the final assembly. We can directly extract these values to create the abundance.abund file. However, no such information is provided for contigs produced by SGA. Hence, reads should be mapped back to the assembled contigs in order to determine the coverage of SGA contigs.

Note: Make sure that the initial binning result consists of contigs belonging to only one bin. GraphBin2 is designed to handle initial contigs which belong to only one bin.

Note: The binning output file should have comma separated values (contig_identifier, bin_number) for each contig. The contents of the binning output file should look similar to the example given below. Contigs are named according to their original identifier and the numbering of bins starts from 1.

Example metaSPAdes binned input

NODE_1_length_507141_cov_16.465306,1
NODE_2_length_487410_cov_94.354557,1
NODE_3_length_483145_cov_59.410818,1
NODE_4_length_468490_cov_20.967912,2
NODE_5_length_459607_cov_59.128379,2
...

Example SGA binned input

contig-0,1
contig-1,2
contig-2,1
contig-3,1
contig-4,2
...

You can use the prepResult.py script to format an initial binning result in to the .csv format with contig identifiers and bin ID. Further details can be found here.

Example Usage

python graphbin2.py --assembler spades --contigs /path/to/contigs.fasta --graph /path/to/graph_file.gfa --paths /path/to/paths_file.paths --binned /path/to/binning_result.csv --output /path/to/output_folder
python graphbin2.py --assembler sga --contigs /path/to/contigs.fa --abundance /path/to/abundance.tsv --graph /path/to/graph_file.asqg --binned /path/to/binning_result.csv --output /path/to/output_folder

Visualization of the metaSPAdes Assembly Graph of the Sim-5G Dataset

Initial Binning Result

Initial binning result

Assembly Graph with Refined Labels

Labels refined

Assembly Graph after Label Propagation

Labels propagated

Assembly Graph with Multi-labelled Vertices

Multi-labelled

References

[1] Barnum, T.P., et al.: Genome-resolved metagenomics identifies genetic mobility, metabolic interactions, and unexpected diversity in perchlorate-reducing communities. The ISME Journal 12, 1568-1581 (2018)

[2] Mallawaarachchi, V., Wickramarachchi, A., Lin, Y.: GraphBin: Refined binning of metagenomic contigs using assembly graphs. Bioinformatics, btaa180 (2020)

[3] Nurk, S., et al.: metaSPAdes: a new versatile metagenomic assembler. Genome Researcg 5, 824-834 (2017)

[4] Simpson, J. T. and Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22(3), 549–556 (2012).

[5] Wang, Z., et al.: SolidBin: improving metagenome binning withsemi-supervised normalized cut. Bioinformatics 35(21), 4229–4238 (2019).

[6] Wu, Y.W., et al.: MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2(1), 26 (2014)

[7] Wu, Y.W., et al.: MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2016)

Citation

GraphBin2 has been accepted for publication at the 20th International Workshop on Algorithms in Bioinformatics (WABI 2020) and is published in Leibniz International Proceedings in Informatics (LIPIcs) DOI: 10.4230/LIPIcs.WABI.2020.8. If you use GraphBin2 in your work, please cite GraphBin2 as follows.

@InProceedings{mallawaarachchi_et_al:LIPIcs:2020:12797,
  author =	{Vijini G. Mallawaarachchi and Anuradha S. Wickramarachchi and Yu Lin},
  title =	{{GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs}},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{8:1--8:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Carl Kingsford and Nadia Pisanti},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2020/12797},
  URN =		{urn:nbn:de:0030-drops-127974},
  doi =		{10.4230/LIPIcs.WABI.2020.8},
  annote =	{Keywords: Metagenomics binning, contigs, assembly graphs, overlapped binning}
}