RNA seq

piPipes RNA-seq pipeline

This document explains how to run the piPipes RNA-seq pipeline and how to interpret the output.

This pipeline provides analysis on genes and transposons abundance at the transcriptome level using paired-end RNA-seq reads generated by Next Generation Sequencing.

The RNA-seq pipeline contains two modes: single-sample mode and dual-sample mode.

Note: this pipeline can also be utilized for general gene quantification.

Example 1. Run single RNA-seq library

Install the mouse mm9 genome, if you haven't done so

# require internet access
piPipes install -g mm9

Download RNA-seq sample data from NCBI SRA

# use fastq-dump from SRATools (http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software)
# to download data and convert to fastq; require internet access
# --split-3 split the left and right read from paired-end RNA-seq data
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
# the two libraries are two replicates from the same genotype

Check usage message

piPipes rna
# or
piPipes rna -h

Using default parameters

# -l: left read of  RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
  
# Or using single-end mode _1 is in the same direction, so option -L is provided
piPipes rna \
  -i Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -L \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
  
piPipes rna \
  -i Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out_SE \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Debug mode with more information printed to stderr

piPipes_debug rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Run the pipeline with optional parameters

# -l: left read of  RNA-seq
# -r: right read of RNA-seq
# -g: genome to use
# -o: output directory
# -L: ligation based RNA-seq construction method.
#     \1 has the same direction as the original RNA.
#     this option is by default off, meaning that dUTP based method is by default.
# -c: number of CPUs to use; using more CPU will significantly increase the speed
# -B: rounds of batch algorithm to use in eXpress. Use default unless it takes too long
# Read more in:
# Roberts A and Pachter L (2012). Streaming fragment assignment for real-time
# analysis of sequencing experiments. Nature Methods.
piPipes_debug rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -c 24 \
  -B 15 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr

Interpretation of the output files

The output directory should contain the following directories and files:

rRNA_mapping/
input_read_files/
cufflinks_output/
bigWig/
htseq_count/
genome_mapping/
Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
pdfs/
direct_transcriptome_mapping/

Zamore.RSQ.amyb_het.testis.14dpp.basic_stats contains the basic statistical information on the library

$ cat Zamore.RSQ.amyb_het.testis.14dpp.basic_stats
total_input_reads:	32594566
rRNA_reads:	722247
genomie_mapper_reads:	26732381
genomie_unique_mapper_reads:	23749430
genomie_multiple_mapper_reads:	2982951
genomie_unmappable_reads:	5139938

rRNA_mapping/ contains log file of rRNA mapping (Bowtie2)

$ cat Zamore.RSQ.amyb_het.testis.14dpp.rRNA.log
32594566 reads; of these:
  32594566 (100.00%) were paired; of these:
    31872319 (97.78%) aligned concordantly 0 times
    722247 (2.22%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
2.22% overall alignment rate

input_read_files/ contains input Fastq file for genome mapping by STAR

# x_rRNA means rRNA mappable sequences have been removed
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.1.fq
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.2.fq

genome_mapping/ contains data for genome mapping

# outputs of STAR. Please see its document for detailed explanation
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.SJ.out.tab
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.progress.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Log.final.out

# unmappable sequence in Fastq format, might be used to map sequences that
# are not in the assembled genome, like transgene
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate1
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Unmapped.out.mate2
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.STAR.log

# genome alignment file, sorted by COORDINATES
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.bam.bai

# genome aligment file in normalized.bedpe format and bed12 format
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12

# normalized bedpe format is a modified BEDPE format (please see BEDTools document for BEDPE)
# briefly, each line represents an alignment for a pair.
# each field, from left to right, represents:
#   chromosome of mate1
#   start of mate1 (0-based, included)
#   end of mate1 (0-based, not included)
#   chromosome of mate2
#   start of mate2 (0-based, included)
#   end of mate2 (0-based, not included)
#   name of this pair, different from small RNA pipeline, reads in RNA-seq pipeline are not compiled
#   normalized read. If this pair can be mapped to N loci in the genome, this number if 1/N
# Note that we called STAR using --outFilterMultimapScoreRange 1, thus only the best mappers
# are reported
# strand of mate1
# strand of mate2
$ head Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.all.normalized.bedpe
chr15	89135450	89135627	chr15	89135620	89135802	FCC1GJJACXX:6:1101:15553:94247	1	+	-
chr5	64702364	64702464	chr5	64702308	64702408	FCC1GJJACXX:6:1101:16706:94222	1	-	+
chr5	54045627	54045727	chr5	54045421	54045521	FCC1GJJACXX:6:1101:15621:94056	1	-	+
chr9	21440366	21440708	chr9	21440233	21440333	FCC1GJJACXX:6:1101:17366:94052	1	-	+
chr3	97943583	97945464	chr3	97943552	97945433	FCC1GJJACXX:6:1101:16755:94111	1	-	+
chrM	5706	5806	chrM	5596	5696	FCC1GJJACXX:6:1101:17540:94078	1	-	+
chr15	12195313	12195413	chr15	12193559	12195305	FCC1GJJACXX:6:1101:16406:94075	1	-	+
chrM	1230	1330	chrM	1176	1276	FCC1GJJACXX:6:1101:16781:94117	1	-	+
chr12	70260428	70260528	chr12	70260297	70260397	FCC1GJJACXX:6:1101:17995:94098	0.5	-	+
chr12	70462221	70462321	chr12	70462352	70462452	FCC1GJJACXX:6:1101:17995:94098	0.5	+	-

# normalized bedpe file with only unique mappers
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.unique.normalized.bedpe

# unique mappers in bed12 format. \1 and \2 occupy two lines
# the strand has been reset to follow the original RNA:
# for dUTP method, \1 has been reversed; for ligation method, \2 has been reversed
$ head -2 Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.bed12
chr10	3000542	3000642	FCC1GJJACXX:6:1202:3559:62122/2	1	+	3000542	3000642	255,0,0	1	100	0
chr10	3000597	3000697	FCC1GJJACXX:6:1202:3559:62122/1	1	+	3000597	3000697	255,0,0	1	100	0

cufflinks_output/ contains the output from Cufflinks

No transposon sequence is included in this GTF file, so the output can be used for general purpose

# piPipes ran cufflinks using gene only GTF without transposon annotation
# as well as --compatible-hits-norm option.
# the depth calculated by cufflinks is used in the entire pipeline to represent depth
# in most of piRNA mutants, transposon reads have huge amount of increase
# using gene compatible reads is less biased than using total reads

# see cufflinks document for more information
skipped.gtf
transcripts.gtf
isoforms.fpkm_tracking
# this document contains FPKM values for annotated gene
genes.fpkm_tracking
Zamore.RSQ.amyb_het.testis.14dpp.cufflinks.log

bigWig/ contains bigWig files for UCSC genome browser. The signal contains unique mappers only and has been normalized to the library depth, which is calculated by cufflinks using annotation compatible reads.

# Watson and Crick strands are separated
# bedGraph files are intermediates file. they might be removed in the future version

Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Crick.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.sorted.unique.Watson.bigWig

htseq_count/ contains outputs from HTSeq

# HTSeq is an alternative abundance calculation tool of Cufflinks and eXpress
# Different from Cufflinks and eXpress, which use EM-algorithm to assign multi-mappers to different
# transcript isoforms, HTSeq only count unique mappers.

# We uses a annotation comprised of Genes, piRNA cluster and repBase (repeatMasker)
# strict and union are two ways HTSeq supports, please see HTSeq document for more information

# S and AS represent sense and antisense;
# We realized that in dUTP method, ~10% of reads lost their strand information
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.strict.AS.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.S.out
Zamore.RSQ.amyb_het.testis.14dpp.x_rRNA.mm9.Genes_repBase_Cluster.htseqcount.union.AS.out

direct_transcriptome_mapping/ contains output from eXpress Different from the "genome mapping + Cufflinks/HTSeq" strategy, this method directly aligns input reads to transcriptome and estimates transcript abundance from there. Same as Cufflinks, eXpress also uses "EM algorithm" for multiple-mappers assignment.

# direct alignment by Bowtie2
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.log
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.bam
Zamore.RSQ.amyb_het.testis.14dpp.gene+transposon.sorted.unique.bed

# transposon sizes file, required to make bigWig
transposon.sizes

# bigWig file for each transcripts (gene, piRNA cluster and repBase transposon)
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bedGraph
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.plus.bigWig
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.minus.bigWig

# ParaFly file, ignore
bigWigSummary.para
bigWigSummary.para.completed

# summary file used to draw plots representing signal across different
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.sorted.unique.bigWig.summary

# output of eXpress calculation. No normalization is used in results.xprs.
# Please see its document for detailed explanation.
# Note that the data will be organized and used in the dual-sample mode
# we suggest to look at data there
results.xprs
params.xprs
Zamore.RSQ.amyb_het.testis.14dpp.gene+cluster+repBase.eXpress.log

# normalized results, using the gene compatible reads calculated by Cufflinks
results.xprs.normalized

Example 2. Run dual-sample mode to compare RNA-seq libraries from two samples, each with two biological replicates

Run single-sample mode for each library

# download all the data: two replicates for each samples mybl1+/- and mybl1-/-
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep1 SRR765641
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_het.testis.14dpp.rep2 SRR765642
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep1 SRR765647
fastq-dump -F --split-3 --gzip -A Zamore.RSQ.amyb_mut.testis.14dpp.rep2 SRR765648

# run the four libraries separately
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep1.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_het.testis.14dpp.rep2_1.fastq.gz \
  -r Zamore.RSQ.amyb_het.testis.14dpp.rep2_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
  1> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stdout \
  2> Zamore.RSQ.amyb_het.testis.14dpp.rep2.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_mut.testis.14dpp.rep1_1.fastq.gz \
  -r Zamore.RSQ.amyb_mut.testis.14dpp.rep1_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out \
  1> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stdout \
  2> Zamore.RSQ.amyb_mut.testis.14dpp.rep1.stderr
piPipes rna \
  -l Zamore.RSQ.amyb_mut.testis.14dpp.rep2_1.fastq.gz \
  -r Zamore.RSQ.amyb_mut.testis.14dpp.rep2_2.fastq.gz \
  -g mm9 \
  -o Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
  1> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stdout \
  2> Zamore.RSQ.amyb_mut.testis.14dpp.rep2.stderr

Run dual-sample mode

# -a: comma delimited directories with RNA-seq single library mode, for sample A
# -b: comma delimited directories with RNA-seq single library mode, for sample B
# -g: mouse mm9 genome
# -o: output directory
# -A: name for sample A, used in output
# -B: name for sample B, used in output
piPipes rna2 \
  -a Zamore.RSQ.amyb_het.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_het.testis.14dpp.rep2.piPipes_out \
  -b Zamore.RSQ.amyb_mut.testis.14dpp.rep1.piPipes_out,Zamore.RSQ.amyb_mut.testis.14dpp.rep2.piPipes_out \
  -g mm9 \
  -o Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp \
  -A amybHet14 \
  -B amybMut14 \
  1> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stdout \
  2> Zamore.RSQ.amyb_het.vs.amyb_mut.14dpp.piPipes.stderr

Interpretation of the output files

The dual-sample mode is very simple:

For genes quantification, it uses Cuffdiff to call differentially expressed genes/transcripts from genome-mapping approach and cummeRbund to generate figures.

For transposon quantification, it generates scatter-plot from eXpress output.

# output
cuffdiff_output/
pdfs/
amybHet14.results.xprs
amybMut14.results.xprs

cuffdiff_output/ contains output of Cuffdiff.

# Cuffdiff and cummeRbund output
var_model.info
amyb.cuffdiff.log
isoform_exp.diff
tss_group_exp.diff
gene_exp.diff
cds_exp.diff
splicing.diff
promoters.diff
cds.diff
isoforms.fpkm_tracking
tss_groups.fpkm_tracking
cds.fpkm_tracking
# fpkm values for genes
genes.fpkm_tracking
# fpkm values for isoforms
isoforms.count_tracking
tss_groups.count_tracking
cds.count_tracking
genes.count_tracking
isoforms.read_group_tracking
tss_groups.read_group_tracking
cds.read_group_tracking
genes.read_group_tracking
read_groups.info
run.info
bias_params.info
cuffData.db

pdfs/ contains pdfs

# the following files are generated by Cuffdiff + cummeRbund, for genes
# they should be self-explanatory
# please see cummeRbund document for detailed information
amyb.genes.csDensity.pdf
amyb.genes.csBoxplot.pdf
amyb_amybHet14_vs_amybMut14.genes.csScatter.pdf
amyb_amybHet14_vs_amybMut14.genes.csVolcano.pdf
amyb.genes.csHeatMap_top50.pdf
amyb.genes.csExpressionBarplot_top50.pdf
amyb.isoforms.csDensity.pdf
amyb.cummeRbund.log
amyb.isoforms.csBoxplot.pdf

# the following files are generated from eXpress output; for gene + piRNA cluster + transposon
# summary file
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance.csv
# this scatter-plot has each dot representing a gene isoforms, a non-coding RNA, piRNA cluster and 
# transposon usually we notice that gene and non-coding RNA align pretty well on the diagonal
# but transposon reads shifted pretty badly to one side
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance1.pdf
# this is a simplified version using contour lines to represent gene and non-coding RNA transcripts
# because we found that there are too many dots in the previous pdf and very hard to manipulate/view
amybHet14_vs_amybMut14.gene_transposon_cluster.abundance2.pdf

Please see example figures from the Github Wiki page or our manuscript.

##Flowchart and example figures from our manuscript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNA seq

piPipes RNA-seq pipeline

Example 1. Run single RNA-seq library

Install the mouse mm9 genome, if you haven't done so

Download RNA-seq sample data from NCBI SRA

Check usage message

Using default parameters

Debug mode with more information printed to stderr

Run the pipeline with optional parameters

Interpretation of the output files

Example 2. Run dual-sample mode to compare RNA-seq libraries from two samples, each with two biological replicates

Run single-sample mode for each library

Run dual-sample mode

Interpretation of the output files

Clone this wiki locally