Skip to content

Genome seq

bowhan edited this page Apr 12, 2016 · 6 revisions

piPipes Genome-seq pipeline

This document explains how to run piPipes Genome-seq pipeline and how to interpret the output.

This pipeline provides analysis on transposon insertion/deletion as well as general structural variation (SV) events using paired-end Genome-seq reads generated by Next Generation Sequencing.

Example. Run Genome-seq library

Install the fly dm3 genome, if you haven't done so

# require internet access
piPipes install -g dm3

Download Genome-seq sample data from NCBI SRA

# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
fastq-dump -F --split-3 --gzip -A Theurkauf.GSQ.w1xHar.21day.ovary SRR333512

Check usage message

piPipes dna
# or
piPipes dna -h

Using default parameters

# -l: left read of  Genome-seq
# -r: right read of Genome-seq
# -g: genome to use
# -o: output directory
piPipes dna \
  -l Theurkauf.GSQ.w1xHar.21day.ovary_1.fastq.gz \
  -r Theurkauf.GSQ.w1xHar.21day.ovary_2.fastq.gz \
  -g dm3 \
  -o Theurkauf.GSQ.w1xHar.21day.ovary.piPipes_out \
  1> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stdout \
  2> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stderr

Run the pipeline with optional parameter

# -l: left read of  Genome-seq
# -r: right read of Genome-seq
# -g: genome to use
# -o: output directory
# -d: VCF filtering depth, passed to "vcfutils.pl varFilter -D" and "retroseq.pl -call -depth"
# -e: Transgene sequence in fasta format. piPipes calls TEMP to identify new transposon
#     insertion site
# -M: Use mrFast and VariationHunter. mrFast requires large amount of memory so
#     it's off by default.
piPipes dna \
  -l Theurkauf.GSQ.w1xHar.21day.ovary_1.fastq.gz \
  -r Theurkauf.GSQ.w1xHar.21day.ovary_2.fastq.gz \
  -g dm3 \
  -d 200 \
  -e P.fa \
  -M  \
  -o Theurkauf.GSQ.w1xHar.21day.ovary.piPipes_out \
  1> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stdout \
  2> Theurkauf.GSQ.w1xHar.21day.ovary.piPipes.stderr

Interpretation of the output files

The output should contain the following directories:

mrFast_VariationHunter_output/
bigWig/
TEMP_output/
break_dancer_out/
bwa_bcftools_output/
retroSeq_discovering/
pdfs/

mrFast_VariationHunter_output/ contains the output of mrFast and Variation Hunter. By default, mrFast and Variation Hunder are not ran because mrFast uses memory that is proportional to the input Fastq. In most cases, we were unable to finish running.

bwa_bcftools_output/ contains the genome alignments as well as SNP calling using bcftools.

# genome alignment by BWA MEM algorithm and sorted by coordinates
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.log
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.bam
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.bai
# SNP calling output
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.var.raw.bcf
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.var.flt.vcf
# bedGraph, used to produce the bigWig file
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bdg

# genome alignment by BWA ALN algorithm, only used by TEMP
Theurkauf.GSQ.w1xHar.21day.1_sequence.sai
Theurkauf.GSQ.w1xHar.21day.2_sequence.sai
Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam
Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam.bai

bigWig/ directory contains files that can be used by UCSC genome browser.

Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bigWig

TEMP_output/ contains output by TEMP. This is the algorithm used in:

Adaptation to P element transposon invasion in Drosophila melanogaster. Cell. 2011 Dec 23;147(7):1551-63.

If feeding this pipeline with -e option and a Fasta file, TEMP will be used to locate the insertion of this sequence in the genome. Our lab constantly uses this function to map transgene insertion site.

# intermediates files
dm3.repBase.fa
/project/umw_phil_zamore/hanb/tmp/GENOME/Theurkauf.GSQ.w1xHar.21day.ovary_1.piPipes_out/bwa_bcftools_output/Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam
/project/umw_phil_zamore/hanb/tmp/GENOME/Theurkauf.GSQ.w1xHar.21day.ovary_1.piPipes_out/bwa_bcftools_output/Theurkauf.GSQ.w1xHar.21day.bwa-aln.sorted.bam.bai
dm3.repBase.fa.bwt
dm3.repBase.fa.pac
dm3.repBase.fa.ann
dm3.repBase.fa.amb
dm3.repBase.fa.sa
Theurkauf.GSQ.w1xHar.21day.bwa-aln.unpair.uniq.transposons.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.unpair.uniq.transposons.filtered.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.uniq.transposons.filtered.wGap.class.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.bp.bed
Theurkauf.GSQ.w1xHar.21day.bwa-aln.clipped.reads.aln
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp

# files with transposon insertion events
Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp.summary

$ head Theurkauf.GSQ.w1xHar.21day.bwa-aln.insertion.refined.bp.summary
Chr	Start	End	TransposonName	TransposonDirection	Class	VariantSupport	Frequency	Junction1	Junction1Support	Junction2	Junction2Support	5'_Support	3'_Support
chr2L	27827	28327	DM176	sense	singleton	1	0.0222	28077	0	28077	0	1	0
chr2L	44874	45374	BLOOD_I	sense	singleton	1	0.0294	45124	0	45124	0	1	0
chr2L	290721	291221	POGON1	antisense	singleton	3	0.2500	291182	1	291185	1	0	1
chr2L	290721	291221	POGO	antisense	singleton	3	0.2500	291182	1	291185	1	0	1
chr2L	362397	362897	ROO_LTR	sense	singleton	1	0.0370	362647	0	362647	0	0	1
chr2L	493282	493782	R1_DM	antisense	singleton	1	0.0909	493532	0	493532	0	0	1
chr2L	512615	512722	LTRMDG3_DM	sense	1p1	8	0.3200	512702	3	512702	0	1	4
chr2L	639122	639211	BEL_LTR	antisense	1p1	12	0.4800	639191	2	639195	2	3	5
chr2L	735384	735884	ROO_LTR	antisense	2p	7	0.3500	735883	2	735888	3	0	2

retroSeq_discovering/ contains output of retroSeq, which is another algorithm seeking transposon movement

# intermediate file
exd.file
Theurkauf.GSQ.w1xHar.21day.dm3.bwa.sorted.retroSeq
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf.candidates

# this two files contain the identified loci
# see https://github.com/tk2/RetroSeq/wiki/RetroSeq-Tutorial for more information
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf
Theurkauf.GSQ.w1xHar.21day.dm3.bwa-mem.sorted.bam.vcf.PE.vcf.bed

break_dancer_out/ contains outputs of Break Dancer for discovering general structural variation in the genome. It is NOT specialized in transposon related insertion/deletion.

# intermediate files
Theurkauf.GSQ.w1xHar.21day.breakdancer
Theurkauf.GSQ.w1xHar.21day.breakdancer.NA.2.fastq
Theurkauf.GSQ.w1xHar.21day.breakdancer.NA.1.fastq
Theurkauf.GSQ.w1xHar.21day.breakdancer.SV.bed
# outputs
# please see https://github.com/kenchen/breakdancer for detailed explanation
Theurkauf.GSQ.w1xHar.21day.breakdancer.out
Theurkauf.GSQ.w1xHar.21day.breakdancer.out.for_Circos

pdfs contains the output pdf Circos plot representing overall looking of TEMP retroSeq and Break Dancer.

# from innert to outer: break dancer, retroSeq and TEMP
# piRNA cluster loci has been marked
Theurkauf.GSQ.w1xHar.21day.summary.circos.pdf

Please see example figures from the Github Wiki page or our manuscript.

##Flowchart and example figures from our manuscript