Skip to content

RNA seq

Bo Han edited this page Mar 24, 2016 · 5 revisions

piPipes RNA-seq pipeline

This document explains how to run the piPipes RNA-seq pipeline and how to interpret the output.

This pipeline provides analysis on genes and transposons abundance at the transcriptome level using paired-end RNA-seq reads generated by Next Generation Sequencing.

The RNA-seq pipeline contains two modes: single-sample mode and dual-sample mode.

Note: this pipeline can also be utilized for general gene quantification.

Example 1. Run single RNA-seq library

Install the mouse mm9 genome, if you haven't done so

# require internet access
piPipes install -g mm9

Download RNA-seq sample data from NCBI SRA

# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
# --split-3 split the left and right read from paired-end RNA-seq data
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
# the two libraries are two replicates from the same genotype

Check usage message

piPipes rna
# or
piPipes rna -h

Using default parameters

# -l: left read of  RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
  
# Or using single-end mode _1 is in the same direction, so option -L is provided
piPipes rna \
  -i Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -L \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
  
piPipes rna \
  -i Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Debug mode with more information printed to stderr

piPipes_debug rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Run the pipeline with optional parameters

# -l: left read of  RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
# -L: ligation based RNA-seq construction method.
#     \1 has the same direction as the original RNA.
#     this option is by default off, meaning that dUTP based method is by default.
# -c: number of CPUs to use; using more CPU will significantly increase the speed
# -B: rounds of batch algorithm to use in eXpress. Use default unless it takes too long
# Read more in:
# Roberts A and Pachter L (2012). Streaming fragment assignment for real-time
# analysis of sequencing experiments. Nature Methods.
piPipes_debug rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -c 24 \
  -B 15 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Interpretation of the output files

The output directory should contain the following directories and files:

rRNA_mapping/
input_read_files/
cufflinks_output/
bigWig/
htseq_count/
genome_mapping/
Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
pdfs/
direct_transcriptome_mapping/

Zamore.RSQ.amyb_het.testis.14dpp.basic_stats contains the basic statistical information on the library

$ cat Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
total_input_reads:	32594566
rRNA_reads:	722247
genomie_mapper_reads:	26732381
genomie_unique_mapper_reads:	23749430
genomie_multiple_mapper_reads:	2982951
genomie_unmappable_reads:	5139938

rRNA_mapping/ contains log file of rRNA mapping (Bowtie2)

$ cat Zamore.RSQ.amyb_het.testis.14dpp.rRNA.log
32594566 reads; of these:
  32594566 (100.00%) were paired; of these:
    31872319 (97.78%) aligned concordantly 0 times
    722247 (2.22%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
2.22% overall alignment rate

input_read_files/ contains input Fastq file for genome mapping by STAR

# x_rRNA means rRNA mappable sequences have been removed
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.1.fq
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.2.fq

genome_mapping/ contains data for genome mapping

# outputs of STAR. Please see its document for detailed explanation
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.SJ.out.tab
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.progress.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.final.out

# unmappable sequence in Fastq format, might be used to map sequences that
# are not in the assembled genome, like transgene
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate1
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate2
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.STAR.log

# genome alignment file, sorted by COORDINATES
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam.bai

# genome aligment file in normalized.bedpe format and bed12 format
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12

# normalized bedpe format is a modified BEDPE format (please see BEDTools document for BEDPE)
# briefly, each line represents an alignment for a pair.
# each field, from left to right, represents:
#   chromosome of mate1
#   start of mate1 (0-based, included)
#   end of mate1 (0-based, not included)
#   chromosome of mate2
#   start of mate2 (0-based, included)
#   end of mate2 (0-based, not included)
#   name of this pair, different from small RNA pipeline, reads in RNA-seq pipeline are not compiled
#   normalized read. If this pair can be mapped to N loci in the genome, this number if 1/N
# Note that we called STAR using --outFilterMultimapScoreRange 1, thus only the best mappers
# are reported
# strand of mate1
# strand of mate2
$ head Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
chr15	89135450	89135627	chr15	89135620	89135802	FCC1GJJACXX:6:1101:15553:94247	1	+	-
chr5	64702364	64702464	chr5	64702308	64702408	FCC1GJJACXX:6:1101:16706:94222	1	-	+
chr5	54045627	54045727	chr5	54045421	54045521	FCC1GJJACXX:6:1101:15621:94056	1	-	+
chr9	21440366	21440708	chr9	21440233	21440333	FCC1GJJACXX:6:1101:17366:94052	1	-	+
chr3	97943583	97945464	chr3	97943552	97945433	FCC1GJJACXX:6:1101:16755:94111	1	-	+
chrM	5706	5806	chrM	5596	5696	FCC1GJJACXX:6:1101:17540:94078	1	-	+
chr15	12195313	12195413	chr15	12193559	12195305	FCC1GJJACXX:6:1101:16406:94075	1	-	+
chrM	1230	1330	chrM	1176	1276	FCC1GJJACXX:6:1101:16781:94117	1	-	+
chr12	70260428	70260528	chr12	70260297	70260397	FCC1GJJACXX:6:1101:17995:94098	0.5	-	+
chr12	70462221	70462321	chr12	70462352	70462452	FCC1GJJACXX:6:1101:17995:94098	0.5	+	-

# normalized bedpe file with only unique mappers
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe

# unique mappers in bed12 format. \1 and \2 occupy two lines
# the strand has been reset to follow the original RNA:
# for dUTP method, \1 has been reversed; for ligation method, \2 has been reversed
$ head -2 Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12
chr10	3000542	3000642	FCC1GJJACXX:6:1202:3559:62122/2	1	+	3000542	3000642	255,0,0	1	100	0
chr10	3000597	3000697	FCC1GJJACXX:6:1202:3559:62122/1	1	+	3000597	3000697	255,0,0	1	100	0

cufflinks_output/ contains the output from Cufflinks

No transposon sequence is included in this GTF file, so the output can be used for general purpose

# piPipes ran cufflinks using gene only GTF without transposon annotation
# as well as --compatible-hits-norm option.
# the depth calculated by cufflinks is used in the entire pipeline to represent depth
# in most of piRNA mutants, transposon reads have huge amount of increase
# using gene compatible reads is less biased than using total reads

# see cufflinks document for more information
skipped.gtf
transcripts.gtf
isoforms.fpkm_tracking
# this document contains FPKM values for annotated gene
genes.fpkm_tracking
Zamore.RSQ.amyb_het.testis.14dpp.cufflinks.log

bigWig/ contains bigWig files for UCSC genome browser. The signal contains unique mappers only and has been normalized to the library depth, which is calculated by cufflinks using annotation compatible reads.

# Watson and Crick strands are separated
# bedGraph files are intermediates file. they might be removed in the future version

Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bigWig

htseq_count/ contains outputs from HTSeq

# HTSeq is an alternative abundance calculation tool of Cufflinks and eXpress
# Different from Cufflinks and eXpress, which use EM-algorithm to assign multi-mappers to different
# transcript isoforms, HTSeq only count unique mappers.

# We uses a annotation comprised of Genes, piRNA cluster and repBase (repeatMasker)
# strict and union are two ways HTSeq supports, please see HTSeq document for more information

# S and AS represent sense and antisense;
# We realized that in dUTP method, ~10% of reads lost their strand information
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.AS.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.AS.out

direct_transcriptome_mapping/ contains output from eXpress Different from the "genome mapping + Cufflinks/HTSeq" strategy, this method directly aligns input reads to transcriptome and estimates transcript abundance from there. Same as Cufflinks, eXpress also uses "EM algorithm" for multiple-mappers assignment.

# direct alignment by Bowtie2
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.log
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.bam
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.sorted.unique.bed

# transposon sizes file, required to make bigWig
transposon.sizes

# bigWig file for each transcripts (gene, piRNA cluster and repBase transposon)
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bigWig

# ParaFly file, ignore
bigWigSummary.para
bigWigSummary.para.completed

# summary file used to draw plots representing signal across different
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.bigWig.summary

# output of eXpress calculation. No normalization is used in results.xprs.
# Please see its document for detailed explanation.
# Note that the data will be organized and used in the dual-sample mode
# we suggest to look at data there
results.xprs
params.xprs
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.eXpress.log

# normalized results, using the gene compatible reads calculated by Cufflinks
results.xprs.normalized

Example 2. Run dual-sample mode to compare RNA-seq libraries from two samples, each with two biological replicates

Run single-sample mode for each library

# download all the data: two replicates for each samples mybl1+/- and mybl1-/-
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep1 SRR765647
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep2 SRR765648

# run the four libraries separately
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep2_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep2_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_mut.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_mut.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_mut.testis.14dpp.rep2_1.fastq.gz \
  -r Zamore.RSQ.amyb_mut.testis.14dpp.rep2_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
  1> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stdout \
  2> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stderr

Run dual-sample mode

# -a: comma delimited directories with RNA-seq single library mode, for sample A
# -b: comma delimited directories with RNA-seq single library mode, for sample B
# -g: mouse mm9 genome
# -o: output directory
# -A: name for sample A, used in output
# -B: name for sample B, used in output
piPipes rna2 \
  -a Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
  -b Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp \
  -A amybHet14 \
  -B amybMut14 \
  1> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stdout \
  2> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stderr

Interpretation of the output files

The dual-sample mode is very simple:

For genes quantification, it uses Cuffdiff to call differentially expressed genes/transcripts from genome-mapping approach and cummeRbund to generate figures.

For transposon quantification, it generates scatter-plot from eXpress output.

# output
cuffdiff_output/
pdfs/
amybHet14.results.xprs
amybMut14.results.xprs

cuffdiff_output/ contains output of Cuffdiff.

# Cuffdiff and cummeRbund output
var_model.info
amyb.cuffdiff.log
isoform_exp.diff
tss_group_exp.diff
gene_exp.diff
cds_exp.diff
splicing.diff
promoters.diff
cds.diff
isoforms.fpkm_tracking
tss_groups.fpkm_tracking
cds.fpkm_tracking
# fpkm values for genes
genes.fpkm_tracking
# fpkm values for isoforms
isoforms.count_tracking
tss_groups.count_tracking
cds.count_tracking
genes.count_tracking
isoforms.read_group_tracking
tss_groups.read_group_tracking
cds.read_group_tracking
genes.read_group_tracking
read_groups.info
run.info
bias_params.info
cuffData.db

pdfs/ contains pdfs

# the following files are generated by Cuffdiff + cummeRbund, for genes
# they should be self-explanatory
# please see cummeRbund document for detailed information
amyb.genes.csDensity.pdf
amyb.genes.csBoxplot.pdf
amyb_amybHet14_vs_amybMut14.genes.csScatter.pdf
amyb_amybHet14_vs_amybMut14.genes.csVolcano.pdf
amyb.genes.csHeatMap_top50.pdf
amyb.genes.csExpressionBarplot_top50.pdf
amyb.isoforms.csDensity.pdf
amyb.cummeRbund.log
amyb.isoforms.csBoxplot.pdf

# the following files are generated from eXpress output; for gene + piRNA cluster + transposon
# summary file
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance.csv
# this scatter-plot has each dot representing a gene isoforms, a non-coding RNA, piRNA cluster and 
# transposon usually we notice that gene and non-coding RNA align pretty well on the diagonal
# but transposon reads shifted pretty badly to one side
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance1.pdf
# this is a simplified version using contour lines to represent gene and non-coding RNA transcripts
# because we found that there are too many dots in the previous pdf and very hard to manipulate/view
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance2.pdf

Please see example figures from the Github Wiki page or our manuscript.

##Flowchart and example figures from our manuscript