figuregen
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
Following the instructions below, you should be able to fully reproduce the results in the paper "Scaling metagenome sequence assembly with probabilistic de Bruijn graphs." If you have any questions, email jason.pell@gmail.com. Obtaining the Data ================== Currently, the easiest way to obtain the data is through Amazon's EBS. The snapshot ID is snap-5b1b473. If you would like to obtain the data through a different method, email jason.pell@gmail.com. The other method is to first obtain the following SRA datasets: SRA050710.1 (MSB2 soil) SRX000429 (E. coli Illumina) SRX000430 (E. coli Illumina) The MSB2 dataset should be conservatively quality trimmed with sequences that have a quality score of B removed. Call this file msb2.fq. The E. coli Illumina datasets should be combined into one file called reads.fa Finally, obtain the E. coli genome (GenBank: U00096.2) and call it ecoli.fa. IMPORTANT: Once you have the data, you need to create a directory called 'data' in this directory that contains the following 3 files: reads.fa ecoli.fa msb2.fq Package Requirements ==================== To generate all of the results, you will need the following packages: R 2.11.1 Python 2.6.6 numpy 1.4.1 graphviz 2.26.3 Octave 3.2.4 Google Sparsehash 1.12 (http://code.google.com/p/google-sparsehash) ABySS 1.3.1 (http://www.bcgsc.ca/platform/bioinfo/software/abyss) Velvet 1.2.01 (http://www.ebi.ac.uk/~zerbino/velvet) Screed (github.com/ctb/screed) Khmer (pdbg branch) (github.com/ctb/khmer) Other version numbers may work fine, but this combination has been verified and tested. Making the Figures ================== Important Note: Generating all of the figures in this section require at least 16GB of memory. Most of the figures and tables can be generated by simply running make in this directory. The figures and tables will appear in the directory 'results.' Many of the figures take several hours to generate, so if you would prefer to make only specific figures, here are your options: make bitsperkmer: Generates table containing the number of bits required to store a k-mer for several different false positive rates. make graphvisual: First, a Python script generates outputs four circular chromosomes with different false positive rates. Then, graphviz (neato specifically) is used to output a PDF containing a visualization of the graphs along with the erroneous k-mers. make clustsize: Generates the Average Component Size vs. FPR graph. make numparts: Generates 10,000 reads (1,000 bp each) and partitions them at false positive rates up to 15%. make seqerror: Creates the E. coli sequencing error vs. false positive table. make diam: Generates the Diameter vs. FPR graph. Other Results ============= Some results are more difficult to automate since they require monitoring memory usage. We have provided the scripts and data to run the same analyses, and the memusg script is in the assemble directory. To replicate our MSB2 assembly results, you can run the following scripts: assemble/001/partition.sh assemble/005/partition.sh assemble/010/partition.sh assemble/015/partition.sh assemble/abyssfull.sh The first four scripts will graphsize filter, partition, and then assemble the dataset. The last script will assemble the entire dataset, which requires at least 33GB of memory. Finally, one can assemble the provided ecoli dataset to compare with the sequencing error vs. false positives table. We used the following commands: ABYSS -k 31 -o contigs.fa data/reads.fa mkdir velvetecoli velveth velvetecoli 31 -fasta -short data/reads.fa velvetg velvetecoli We chose to only do single-end assemblies because our focus was on demonstrating memory efficiency.