figuregen/README

Following the instructions below, you should be able to fully reproduce the 
results in the paper "Scaling metagenome sequence assembly with 
probabilistic de Bruijn graphs." If you have any questions, email 
jason.pell@gmail.com.

Obtaining the Data
==================
Currently, the easiest way to obtain the data is through Amazon's 
EBS. The snapshot ID is snap-5b1b473. If you would like to obtain 
the data through a different method, email jason.pell@gmail.com.

The other method is to first obtain the following SRA datasets:
SRA050710.1 (MSB2 soil)
SRX000429 (E. coli Illumina)
SRX000430 (E. coli Illumina)

The MSB2 dataset should be conservatively quality trimmed 
with sequences that have a quality score of B removed. Call this 
file msb2.fq.

The E. coli Illumina datasets should be combined into one file 
called reads.fa

Finally, obtain the E. coli genome (GenBank: U00096.2) and call it 
ecoli.fa.

IMPORTANT: Once you have the data, you need to create a directory 
called 'data' in this directory that contains the following 3 files:

	reads.fa
	ecoli.fa
	msb2.fq

Package Requirements
====================
To generate all of the results, you will need the following packages:

R 2.11.1
Python 2.6.6
numpy 1.4.1
graphviz 2.26.3
Octave 3.2.4
Google Sparsehash 1.12 (http://code.google.com/p/google-sparsehash)
ABySS 1.3.1 (http://www.bcgsc.ca/platform/bioinfo/software/abyss)
Velvet 1.2.01 (http://www.ebi.ac.uk/~zerbino/velvet)
Screed (github.com/ctb/screed)
Khmer (pdbg branch) (github.com/ctb/khmer)

Other version numbers may work fine, but this combination has been 
verified and tested.

Making the Figures
==================
Important Note: Generating all of the figures in this section 
require at least 16GB of memory.

Most of the figures and tables can be generated by simply running 

	make 

in this directory. The figures and tables will appear in the directory 
'results.' Many of the figures take several hours to generate, so if 
you would prefer to make only specific figures, here are your options:

make bitsperkmer:
Generates table containing the number of bits required to store a k-mer for 
several different false positive rates.

make graphvisual:
First, a Python script generates outputs four circular chromosomes with 
different false positive rates. Then, graphviz (neato specifically) is 
used to output a PDF containing a visualization of the graphs along with 
the erroneous k-mers.

make clustsize:
Generates the Average Component Size vs. FPR graph.

make numparts:
Generates 10,000 reads (1,000 bp each) and partitions them at false positive 
rates up to 15%.

make seqerror:
Creates the E. coli sequencing error vs. false positive table.

make diam:
Generates the Diameter vs. FPR graph.

Other Results
=============
Some results are more difficult to automate since they require monitoring 
memory usage. We have provided the scripts and data to run the 
same analyses, and the memusg script is in the assemble directory.

To replicate our MSB2 assembly results, you can run the following scripts:
	assemble/001/partition.sh
	assemble/005/partition.sh
	assemble/010/partition.sh
	assemble/015/partition.sh
	assemble/abyssfull.sh

The first four scripts will graphsize filter, partition, and then assemble 
the dataset. The last script will assemble the entire dataset, which 
requires at least 33GB of memory.

Finally, one can assemble the provided ecoli dataset to compare with 
the sequencing error vs. false positives table. We used the following 
commands:
ABYSS -k 31 -o contigs.fa data/reads.fa
mkdir velvetecoli
velveth velvetecoli 31 -fasta -short data/reads.fa
velvetg velvetecoli

We chose to only do single-end assemblies because our focus was on 
demonstrating memory efficiency.