Tutorial Barcoded Targeted

title: "Tutorial: Ultra-rare variant detection using consensus reads and targeted sequencing"

In this exercise, you will analyze targeted sequencing of the initial burst of genetic diversity in a short E. coli evolution experiment. This tutorial uses data prepared by a special library preparation technique that adds a "molecular index" to each initial DNA fragment. This enables one to sequence many amplification products from this initial read to achieve lower error rates. In addition pulldowns with biotinylated oligos were used to enrich for only certain genes in the E. coli genome to achieve deeper sequencing of regions that were expected to have beneficial mutations.

Note

This tutorial was created for the EMBO Practical Course Measuring intra-species diversity using high-throughput sequencing held 27–31 July 2015 in Oeras, Portugal.

Warning

If you encounter any _breseq or _gdtools errors or crashes in running this tutorial, please report issues on GitHub.

1. Download data files

First, create a directory called sscs_targeted:

$ mkdir tutorial_barcoded_targeted
$ cd tutorial_barcoded_targeted

Reference sequence

The samples sequenced were genomic DNA from populations evolved from a clonal isolate of Escherichia coli B strain REL606. We'll contrast some new ways of analyzing this data that require us to use different reference genome setups.

First, download the entire reference genome.

Download REL606.gbk via this link

Then, download these two reference files:

In the first of these (on-target.gff3), we've extracted just the 8 target genes with 1400 bp added on each side as three different reference fragments. Each of these reference sequences were created by using gdtools APPLY to delete the rest of the genome.

In the second of these (off-target.gff3), we've masked (via N's) the targeted regions in the whole genome sequence. This reference sequence was created by using gdtools MASK.

This separation of the reference sequence and the 8 targeted regions into different files and (artificial) DNA fragments will makes it easier to calculate certain statistics about enrichment of the target region relative to the rest of the genome and to ignore mutations that occur outside the regions of interest

Read files

Paired-end Illumina reads for one population sample taken at generation 139 from a time-course:

2. Generate SSCS Reads

First, we need to pre-process the reads to construct single-strand consensus reads and remove the molecular barcodes. If you have numpy and other Python prerequisites installed, you can do this by downloading this script:

Download SSCS_DCS.py

And then running this command:

python SSCS_DCS.py -f1 DED234_GATCAG_L004_R1_001.fastq -f2 DED234_GATCAG_L004_R2_001.fastq -p DED234 -s -d -m 2 --log SSCS_Log

This script will take about 30 minutes to run.

Warning

This script is memory intensive! (16 GB RAM required)

3. Run _breseq on the consensus reads

Use this command to analyze the consensus reads.

$ breseq -t -j 8 -o consensus_reads -p --polymorphism-minimum-coverage-each-strand 0 --polymorphism-bias-cutoff 0 --polymorphism-score-cutoff 0 --polymorphism-reject-indel-homopolymer-length 0 --polymorphism-reject-surrounding-homopolymer-length 0 -r on-target.gff3 -s off-target.gff3 DED234_SSCS.fastq

Notice the new -t and -s options. What are these doing? Look in the help under Reference File Options.

Take a look at the _breseq output. In particular, examine the summary statistics to look at the distribution of coverage across the different reference sequence fragments that were used. How effective was the enrichment step at favoring reads from those regions?

4. Run _breseq on the original reads

Your next task is to compare the performance of the consensus reads to the original reads. To make this comparison fair, you need to include the same numbers of original reads and consensus reads. Do this by extracting the same number of lines from the R1 file as exist in the new DED234_SSCS.fastq file.

How? I need a hint

Check out the wc -l and head unix commands.

You also need to trim the molecular barcodes from these reads (this was done automatically by the SSCS_DCS.py script for us before).

Then, run a similar _breseq command to the one above to generate an original_reads output directory.

Compare the overall _breseq predictions on each data set (possibly by making a comparison table, as we have in the previous tutorials).

Also, take a look at a part of the _breseq output that you may not have examined yet. On the summary statistics page, click on the link named errors in the Read File Information table at the top. This graph shows the sequencing error rates in the input reads. Compare the results for the consensus and original reads.

Home

Quick Start
Installation
Test Drive

More Options
Usage: breseq
Usage: gdtools

More Information
GenomeDiff File Format
Reference Sequence File Formats
Output
Methods
Bibliography
FAQ

More Examples
Tutorial: Clones
Tutorial: Populations
Tutorial: Barcoded/Targeted
Tutorial: Curation

Contribute
Developer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial Barcoded Targeted

title: "Tutorial: Ultra-rare variant detection using consensus reads and targeted sequencing"

1. Download data files

Reference sequence

Read files

2. Generate SSCS Reads

3. Run _breseq on the consensus reads

4. Run _breseq on the original reads

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally