forked from dib-lab/khmer
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
114 lines (89 loc) · 3.66 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
Following the instructions below, you should be able to fully reproduce the
results in the paper "Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs." If you have any questions, email
jason.pell@gmail.com.
Obtaining the Data
==================
Currently, the easiest way to obtain the data is through Amazon's
EBS. The snapshot ID is snap-5b1b473. If you would like to obtain
the data through a different method, email jason.pell@gmail.com.
The other method is to first obtain the following SRA datasets:
SRA050710.1 (MSB2 soil)
SRX000429 (E. coli Illumina)
SRX000430 (E. coli Illumina)
The MSB2 dataset should be conservatively quality trimmed
with sequences that have a quality score of B removed. Call this
file msb2.fq.
The E. coli Illumina datasets should be combined into one file
called reads.fa
Finally, obtain the E. coli genome (GenBank: U00096.2) and call it
ecoli.fa.
IMPORTANT: Once you have the data, you need to create a directory
called 'data' in this directory that contains the following 3 files:
reads.fa
ecoli.fa
msb2.fq
Package Requirements
====================
To generate all of the results, you will need the following packages:
R 2.11.1
Python 2.6.6
numpy 1.4.1
graphviz 2.26.3
Octave 3.2.4
Google Sparsehash 1.12 (http://code.google.com/p/google-sparsehash)
ABySS 1.3.1 (http://www.bcgsc.ca/platform/bioinfo/software/abyss)
Velvet 1.2.01 (http://www.ebi.ac.uk/~zerbino/velvet)
Screed (github.com/ctb/screed)
Khmer (pdbg branch) (github.com/ctb/khmer)
Other version numbers may work fine, but this combination has been
verified and tested.
Making the Figures
==================
Important Note: Generating all of the figures in this section
require at least 16GB of memory.
Most of the figures and tables can be generated by simply running
make
in this directory. The figures and tables will appear in the directory
'results.' Many of the figures take several hours to generate, so if
you would prefer to make only specific figures, here are your options:
make bitsperkmer:
Generates table containing the number of bits required to store a k-mer for
several different false positive rates.
make graphvisual:
First, a Python script generates outputs four circular chromosomes with
different false positive rates. Then, graphviz (neato specifically) is
used to output a PDF containing a visualization of the graphs along with
the erroneous k-mers.
make clustsize:
Generates the Average Component Size vs. FPR graph.
make numparts:
Generates 10,000 reads (1,000 bp each) and partitions them at false positive
rates up to 15%.
make seqerror:
Creates the E. coli sequencing error vs. false positive table.
make diam:
Generates the Diameter vs. FPR graph.
Other Results
=============
Some results are more difficult to automate since they require monitoring
memory usage. We have provided the scripts and data to run the
same analyses, and the memusg script is in the assemble directory.
To replicate our MSB2 assembly results, you can run the following scripts:
assemble/001/partition.sh
assemble/005/partition.sh
assemble/010/partition.sh
assemble/015/partition.sh
assemble/abyssfull.sh
The first four scripts will graphsize filter, partition, and then assemble
the dataset. The last script will assemble the entire dataset, which
requires at least 33GB of memory.
Finally, one can assemble the provided ecoli dataset to compare with
the sequencing error vs. false positives table. We used the following
commands:
ABYSS -k 31 -o contigs.fa data/reads.fa
mkdir velvetecoli
velveth velvetecoli 31 -fasta -short data/reads.fa
velvetg velvetecoli
We chose to only do single-end assemblies because our focus was on
demonstrating memory efficiency.