-
Notifications
You must be signed in to change notification settings - Fork 23
Tutorial Barcoded Targeted
In this exercise, you will analyze targeted sequencing of the initial burst of genetic diversity in a short E. coli evolution experiment. This tutorial uses data prepared by a special library preparation technique that adds a "molecular index" to each initial DNA fragment. This enables one to sequence many amplification products from this initial read to achieve lower error rates. In addition pulldowns with biotinylated oligos were used to enrich for only certain genes in the E. coli genome to achieve deeper sequencing of regions that were expected to have beneficial mutations.
Note
This tutorial was created for the EMBO Practical Course Measuring intra-species diversity using high-throughput sequencing held 27–31 July 2015 in Oeras, Portugal.
Warning
If you encounter any _breseq or _gdtools errors or crashes in running this tutorial, please report issues on GitHub.
First, create a directory called sscs_targeted:
$ mkdir tutorial_barcoded_targeted
$ cd tutorial_barcoded_targetedThe samples sequenced were genomic DNA from populations evolved from a clonal isolate of Escherichia coli B strain REL606. We'll contrast some new ways of analyzing this data that require us to use different reference genome setups.
First, download the entire reference genome.
Download REL606.gbk via this link
Then, download these two reference files:
In the first of these (on-target.gff3), we've extracted just the 8
target genes with 1400 bp added on each side as three different
reference fragments. Each of these reference sequences were created by
using gdtools APPLY to delete the rest of the genome.
In the second of these (off-target.gff3), we've masked (via N's) the
targeted regions in the whole genome sequence. This reference sequence
was created by using gdtools MASK.
This separation of the reference sequence and the 8 targeted regions into different files and (artificial) DNA fragments will makes it easier to calculate certain statistics about enrichment of the target region relative to the rest of the genome and to ignore mutations that occur outside the regions of interest
Paired-end Illumina reads for one population sample taken at generation 139 from a time-course:
First, we need to pre-process the reads to construct single-strand consensus reads and remove the molecular barcodes. If you have numpy and other Python prerequisites installed, you can do this by downloading this script:
And then running this command:
python SSCS_DCS.py -f1 DED234_GATCAG_L004_R1_001.fastq -f2 DED234_GATCAG_L004_R2_001.fastq -p DED234 -s -d -m 2 --log SSCS_LogThis script will take about 30 minutes to run.
Warning
This script is memory intensive! (16 GB RAM required)
3. Run _breseq on the consensus reads
Use this command to analyze the consensus reads.
$ breseq -t -j 8 -o consensus_reads -p --polymorphism-minimum-coverage-each-strand 0 --polymorphism-bias-cutoff 0 --polymorphism-score-cutoff 0 --polymorphism-reject-indel-homopolymer-length 0 --polymorphism-reject-surrounding-homopolymer-length 0 -r on-target.gff3 -s off-target.gff3 DED234_SSCS.fastqNotice the new -t and -s options. What are these doing? Look in the help under Reference File Options.
Take a look at the _breseq output. In particular, examine the summary statistics to look at the distribution of coverage across the different reference sequence fragments that were used. How effective was the enrichment step at favoring reads from those regions?
4. Run _breseq on the original reads
Your next task is to compare the performance of the consensus reads to
the original reads. To make this comparison fair, you need to include
the same numbers of original reads and consensus reads. Do this by
extracting the same number of lines from the R1 file as exist in the new
DED234_SSCS.fastq file.
How? I need a hint
Check out the wc -l and head unix commands.
You also need to trim the molecular barcodes from these reads (this was
done automatically by the SSCS_DCS.py script for us before).
Then, run a similar _breseq command to the one above to generate an
original_reads output directory.
Compare the overall _breseq predictions on each data set (possibly by making a comparison table, as we have in the previous tutorials).
Also, take a look at a part of the _breseq output that you may not have examined yet. On the summary statistics page, click on the link named errors in the Read File Information table at the top. This graph shows the sequencing error rates in the input reads. Compare the results for the consensus and original reads.
Quick Start
Installation
Test Drive
More Options
Usage: breseq
Usage: gdtools
More Information
GenomeDiff File Format
Reference Sequence File Formats
Output
Methods
Bibliography
FAQ
More Examples
Tutorial: Clones
Tutorial: Populations
Tutorial: Barcoded/Targeted
Tutorial: Curation
Contribute
Developer