Psirc (pseudo-alignment identification of circular RNAs) is for back-splicing junction detection, full-length linear and circular transcript isoform inference and quantification from RNA-seq data.
The whole psirc pipeline has two main parts: 1. Detecting back-splicing junctions (BSJs) and inferring full-length isoforms (FLIs); and 2. Quantification of FLIs (both linear and circular FLIs at the same time). Each part works well stand-alone, but it is recommended to use them together.
If you use psirc in your study, please cite:
Ken Hung-On Yu*, Christina Huan Shi*, Bo Wang, Savio Ho-Chit Chow, Grace Tin-Yun Chung, Ke-En Tan, Yat-Yuen Lim, Anna Chi-Man Tsang, Kwok-Wai Lo, Kevin Y. Yip. Quantifying full-length circular RNAs in cancer. Genome Research 31.12 (2021): 2340-2353. Available from: https://genome.cshlp.org/content/31/12/2340.short
- External libraries
- Installation
- Requirements
- Synopsis of requirements
- General usages
- Synopsis of outputs
- Generation of custom_transcriptome_fa
- Details of the FLI sequences output
- Details of the psirc-quant output
- References
- License
- zlib
- HDF5 C libraries
The first part of psirc was implemented with Perl script, which can be run directly. The second part of psirc was implemented with C and C++.
git clone https://github.com/Christina-hshi/psirc.git
cd psirc
cd psirc-quant
#you may need to compile htslib under "ext/htslib" by following the README there ("make install" is optional)
mkdir release
cd release
cmake ..
make psirc-quant
#the psirc-quant program can be found at "src/psirc-quant"
make install (optional)
- Input: paired-end RNA-seq reads
- custom_transcriptome_fa
- Forked kallisto1: Linux or Mac executable
The psirc pipeline has two parts. The first part is the psirc script (for detecting BSJs and inferring FLIs), currently psirc_v1.0.pl which is a Perl script that fully automates the production of the BSJ detection and FLI inference outputs from the above requirements. The second part is psirc-quant which quantifies the abundances of FLIs based upon RNA-seq data.
Input: paired-end RNA-seq reads
The RNA-seq reads should be sequenced from a library preparation strategy that retains circular RNAs (circRNAs), such as ribosomal RNA (rRNA) depletion or exome-capture RNA-seq. We only accept paired-end reads as single-end reads have inherent read density biases and false positive alignments, making them not recommended for circRNA detection2. The paired-end reads need to be in FASTQ format and can be gzipped.
custom_transcriptome_fa
A FASTA file that contains all annotated transcript sequences of a reference sequence, with each sequence having a custom header (indicating various positional, name, and strand information). Ready-to-use custom_transcriptome_fas are available for human and for Epstein-barr virus (EBV) since these were used in our study, although it is very easy to generate one for any well-annotated transcriptome by following the Generation of custom_transcriptome_fa section. The one we generated for human is called "gencode.v29.annotation.custom_transcriptome.fa" and will be used in the General usages section.
Forked kallisto
A forked version of kallisto v0.43.1, which was modified to allow multi-threading. This is the last version which outputs a SAM formatted pseudo-alignment to stdout, allowing the processing of the pseudo-alignment as it is being generated simultaneously, and is the reason why this version is used. The forked kallisto executable is available for Linux or Mac, or can be compiled from the source code. "kallisto" in the General usages section refers to this version of kallisto.
Index the custom_transcriptome_fa (need to be performed once only):
perl psirc_v1.0.pl -i gencode.v29.annotation.custom_transcriptome.fa kallisto
Produce both BSJ detection and FLI inference outputs in a single run (recommended):
perl psirc_v1.0.pl -f -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq
Produce BSJ detection output only:
perl psirc_v1.0.pl -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq
Produce FLI inference output from the result of BSJ detection output only:
perl psirc_v1.0.pl -s output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq
where gencode.v29.annotation.custom_transcriptome.fa is the custom_transcriptome_fa, kallisto is the forked kallisto, output_directory is the user-specified directory to place the outputs, and R1.fastq R2.fastq are the input paired-end RNA-seq reads.
Index the inferred FLI
We require the header lines of the circular transcripts in fasta format should end with "\tC" to let the program know that they are circular transcripts. And header lines of linear transcripts should not end with "\tC". The outputs produced from psirc_v1.0.pl already meet this requirement, but the outputs produced by other FLI inference tools may not.
psirc-quant index [arguments] <FLI sequences>
Required argument:
-i, --index=STRING Filename for the index to be constructed
Optional argument:
-k, --kmer-size=INT k-mer (odd) length (default: 31, max value: 31)
--make-unique Replace repeated target names with unique names
Quantify FLI:
psirc-quant quant [arguments] R1.fastq R2.fastq
Required arguments:
-i, --index=STRING Filename for the index to be used for
quantification
-o, --output-dir=STRING Directory to write output to
Optional arguments:
--bias Perform sequence based bias correction
-b, --bootstrap-samples=INT Number of bootstrap samples (default: 0)
--seed=INT Seed for the bootstrap sampling (default: 42)
--plaintext Output plaintext instead of HDF5
--fusion Search for fusions for Pizzly
--single Quantify single-end reads
--single-overhang Include reads where unobserved rest of fragment is
predicted to lie outside a transcript
--fr-stranded Strand specific reads, first read forward
--rf-stranded Strand specific reads, first read reverse
-l, --fragment-length=DOUBLE Estimated average fragment length
-s, --sd=DOUBLE Estimated standard deviation of fragment length
-x, --min-fragment-length Minimum length of a valid fragment
-X, --max-fragment-length Maximum length of a valid fragment
(default: -l, -s values are estimated from paired
end data, but are required when using --single)
-t, --threads=INT Number of threads to use (default: 1)
--pseudobam Save pseudoalignments to transcriptome to BAM file
--genomebam Project pseudoalignments to genome sorted BAM file
-g, --gtf GTF file for transcriptome information
(required for --genomebam)
-c, --chromosomes Tab separated file with chrosome names and lengths
(optional for --genomebam, but recommended)
Please note that the <min-fragment-length> and <max-fragment-length> options are crucial when identifying fragments supporting back-splicing junctions. We suggest you first try <fragment-length> +|- 3*<sd> as the min. and max. fragment length.
BSJ transcript list (candidate_circ_junctions.bed)
A list of all the detected BSJ loci and their supporting transcripts (including which exons are back-spliced) in BED format. The first three columns (chr, start, end) are the BSJ loci. The fourth column begins with the BSJ supporting read count, then a ":" separator, followed by the BSJ transcripts of the loci. The read count is the sum of the reads crossing the BSJ and the last-first exons reads of the BSJ transcript. To indicate how many reads are in each category, the fifth column is a value between 0 and 1, crossing BSJ reads / (crossing BSJ reads + last-first exons reads), with 1 indicating all reads are crossing BSJ reads, and 0 indicating all reads are last-first exons reads.
BSJ transcript sequences (candidate_circ_junctions.fa)
A FASTA file containing the sequences of each detected BSJ transcript.
BSJ transcript supporting reads SAM file (candidate_circ_supporting_reads.sam)
All detected BSJ supporting reads mapped to their BSJ transcripts in SAM format. These are pseudo-alignments directly outputted from kallisto. This information is useful for certain analyses, such as visualizing the supporting reads mapped to the BSJ transcript on a genome browser.
FLI list (full_length_isoforms.tsv)
A list of all inferred full-length linear and circular isoforms. For circular isoforms, the information in the BSJ transcript list file is included again. Isoforms with alternative splicing have their alternatively spliced exon numbers and supporting read counts indicated. Isoforms with exactly the same sequence have their names merged with a "," separator.
FLI sequences (full_length_isoforms.fa)
A FASTA file containing the sequences of each inferred FLI. Circular isoforms have an additional tab delimiter preceding a "C" character at the end of the FASTA header line. This is the input file to psirc full-length circular and linear isoform quantification, psirc-quant.
FLI alternative forward-splicing junctions supporting reads SAM file (full_length_isoforms_alt_fsj_supporting_reads.sam) All detected alternatively spliced reads mapped to their full-length isoforms in SAM format. These are again pseudo-alignments directly outputted from kallisto.
Abundances of FLIs (abundance.tsv and abundance.h5) Estimated abundances of FLIs in both tab delimited format("abundance.tsv") and HDF5 format("abundance.h5").
custom_transcriptome_fas are generated through a provided Perl script, create_custom_transcriptome_fa.pl. The only two information required by create_custom_transcriptome_fa.pl are the reference genome FASTA file and its annotation GTF file. The reference genome FASTA file is one fasta file containing all the interested reference sequences of the genome (e.g. for human, chrs 1-22, M, X, and Y, each under a separate header), and the annotation GTF file needs to include at least the ‘exon’ entries of the annotations. The necessary fields that need to be present in each ‘exon’ entry are seqname, start, end, strand, and "gene_name" and "exon_number" in attributes. For GTF files downloadable from official sources (Ensembl, GENCODE, etc.), the exon entries and the necessary fields are already all present.
The FASTA headers in the FLI sequences file contain the names (ID) of each isoform. The ID begins with the transcript name. The ID of each circular transcript isoform ends with the exon connection information in the form of "_EwBx{_yiAzi}*", where w and x indicate the 3' and 5' exons that define the BSJ, and each pair of yi and zi defines an alternative splicing junction that involve non-adjacent exons, and there can be zero, one or more of them.
There are hence four possible types of FLIs in this file: linear, circular, linear alternative, and circular alternative. These can be distinguished simply based on the exon connection information described above. Linear isoforms have no exon connection information at all, circular isoforms only have the "_EwBx" part, linear alternative isoforms only have the "{_yiAzi}*" part, and circular alternative isoforms have the entire exon connection information.
Estimated abundances of FLIs are output in both tab delimited format("abundance.tsv") and HDF5 format("abundance.h5"). Other useful information collected during running of the program can be found in file "run_info.json". psirc-quant quantifies linear and circular FLIs together by maximizing the likelihood of generating the observed sequencing reads from FLIs.
- Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-7.
- Szabo L, Salzman J. Detecting circular RNAs: bioinformatic and experimental challenges. Nat Rev Genet. 2016;17(11):679-692.
First part of the psirc pipeline for BSJ detection and full-length isoform inference is distributed under the MIT license. Second part of the psirc pipeline for full-length isoform quantification is developed based upon kallisto, and distributed under the BSD 2-Clause license.