An R
package for analysis of sRNA & mRNA populations in plants and animals. Specalised in the identification of RNA populations in systems with multiple genomes, such as plant graft systems.
Katie Jeynes-Cupper, University of Illinois Urbana-Champaign, kejc@illinois.edu Marco Catoni, University of Birmingham
Recently we were made aware of a bug in the RNAmergeGenomes function which has now been fixed. Similarly, we introduced a new plotting function to plot the distribution of sRNAs across genomic features.
mobileRNA is an R
package that provides a pipeline for
pre-processing and analysis of small RNA (sRNA) and messenger RNA (mRNA)
sequencing data, primarily for the identification of mobile RNAs in plant graft
systems. But, likely has other applications for systems containing more than
one genotype, such as dual-host systems, to identify RNA molecules produced by
each genotypes. mobileRNA also supports routine treatment vs control
analysis to identify changes in sRNA population (abundance & production). These
two workflows have be separated as mobile RNA analysis & the core RNA analysis,
respectively (Figure 1).
Most available genomics approaches to identify RNA molecules produced by two different genotypes in a biological sample involve the alignment on a genotype of reference and post alignment screening of genetic variants. These methods have many possible limitations. mobileRNA aims to circumvents such problems with an alignment step which is simultaneously performed on both genomes involved. This approach considers that alignment tools already implement an algorithm ideal for identifying the best matches of reads to a given genome reference, but they do not account for potential matches to DNA sequences which are not provided as reference. The downstream analysis allows for differential analysis for the comparison in abundance, the identification of changes in RNA production and the identification of putative mobile RNA molecules via the filtering of RNA molecules which uniquely map to the mobile genome of to the other grafting partner involved, which would represent the putative mobile RNA molecules.
Figure 1: Basic diagram of mobileRNA workflows.
The latest version of the package can be install directly from this GitHub repo:
if (!require("devtools")) install.packages("devtools")
devtools::install_github("KJeynesCupper/mobileRNA", ref = "main")
To install from Bioconductor:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mobileRNA")
Load package into R library:
library(mobileRNA)
In order to utilise the mobileRNA method, please install the following OS dependencies within a Conda environment:
For sRNA data, ShortStack
(>= 4.0) (Axtell 2013) is requires. Please consider
that ShortStack
is not available for Windows, hence, Windows users will either
need to opt to use a virtual machine or
Windows Subsystem for Linux
In either case, both R
and ShortStack
will need to be installed and used on
the Linux side. Please head to ShortStack
to see the recommended installation instructions with Conda.
This will ensure all dependencies are available within the same environment.
For mRNA data, HISAT2 (Kim 2015), HTSeq (Anders, Pyl, and Huber 2014), SAMtools (Danecek P 2021) are required within the same Conda environment [@Anaconda].
The example data set included with mobileRNA simulates grafting between eggplant and tomato where the scion is eggplant and tomato is the rootstock. There is example mRNAseq and sRNAseq files which represent samples taken from an eggplant leaf tissue. Here we will locate sRNA produced by the tomato rootstock which have traveled into the eggplant leaves
Merge the FASTA genome assemblies of tomato and eggplant into a single reference file stored in your desired directory.
fasta_1 <- system.file("extdata","reduced_chr12_Eggplant.fa.gz",
package="mobileRNA")
fasta_2 <-system.file("extdata","reduced_chr2_Tomato.fa.gz",
package="mobileRNA")
# define temporary output directory - replace with your directory
output_assembly_file <- file.path(tempfile("merged_assembly",
fileext = ".fa"))
# merge
merged_reference <- RNAmergeGenomes(genomeA = fasta_1,
genomeB = fasta_2,
output_file = output_assembly_file)
Align sRNA sequencing reads to the merged genome using our unique
alignment pipeline wrapped by the mapRNA()
function.
samples <- system.file("extdata/sRNAseq",package="mobileRNA")
output_location <- tempdir()
mapRNA(input = "sRNA",
input_files_dir = samples,
output_dir = output_location,
genomefile = output_assembly_file,
condaenv = "/Users/user-name/miniconda3/envs/ShortStack4",
mmap = "n")
Import the results from the alignment step into R using the
RNAimport()
function. This requires the directory storing the sample
output folders and the same of the samples to import from the directory.
# Directory containing results
results_dir <- file.path(output_location,"2_alignment_results")
# Sample names and total number of reads, in the same order.
sample_names <- c("selfgraft_demo_1", "selfgraft_demo_2", "selfgraft_demo_3",
"heterograft_demo_1", "heterograft_demo_2", "heterograft_demo_3")
sRNA_data <- RNAimport(input = "sRNA",
directory = results_dir,
samples = sample_names)
Now lets use the comprehensive dataset, load the pre-processed data:
data("sRNA_data")
For a given sRNA cluster, each replicate has determined the dicercall, also known as the sRNA class, based on the length in nucleotides of the most abundant sRNA. This can be drawn from all samples or named samples. The output can be used as threshold values for downstream analysis, and to remove data noise depending on data quality.
sRNA_data_summary <- RNAdicercall(data = sRNA_data, tidy = TRUE )
Undertake differential analysis of sRNA within the experimental design
to explore changes in abundance. The function allows for two methods;
edgeR
or DESeq
.
## sample conditions in order within dataframe
groups <- c("Selfgraft", "Selfgraft", "Selfgraft",
"Heterograft", "Heterograft", "Heterograft")
## Differential analysis of whole dataset: DESeq2 method
sRNA_DESeq2 <- RNAdifferentialAnalysis(data = sRNA_data,
group = groups,
method = "DESeq2")
# save output as txt file
write.table(sRNA_DESeq2, "./sRNA_DA_output.txt")
Summarise results:
RNAsummary(sRNA_DESeq2)
How about summarizing the sRNA population which are statistically significant:
RNAsummary(sRNA_DESeq2, alpha=0.05)
Select the putative mobile sRNA clusters using RNAmobile()
. This
requires supplying the function with a unique identifier of the
rootstock genome. The merging step placed the prefix "B" to the tomato
chromosomes.
# define control samples
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_DESeq2,
controls = controls,
genome.ID = "B",
task = "keep")
# save output as txt file
write.table(mobile_sRNA, "./sRNA_mobile_output.txt")
A data frame where rows represent potential mobile sRNA clusters. The columns include information on the cluster, individual sample replicates, and more.
Locus
: Name of the chromosome or scaffold, start position & end positionchr
: Name of the chromosome or scaffoldstart
: Start position of the clusterend
: End position of the cluster
Cluster
: Cluster NameDicercall
: The size of most abundant small RNA sizeCount
: Number of reads, default is uniquely aligned (e.g. not multi-mapping).MajorRNA
: RNA sequence of the most abundant sRNA in cluster within the sampleRPM
: Reads per MillionFPKM
: Fragments Per Kilobase of transcript per Million
DicerConsensus
: Consensus sRNA classDicerCounts
: Number of replicates which contributed to the consensus dicercall sRNA classCountMean
: Count mean (Calculated byRNAdifferentialAnalysis()
)log2FoldChange
: Log2FoldChange--The effect size estimatepvalue
: P value, the probability under the assumption of no effect or no difference, of obtaining a result equal to or more extreme than what was actually observedpadjusted
: A p-value adjustmentlogCPM
: log counts per million, measure of expression level
For the identification of mobile mRNA, the same mobileRNA
workflow can be
used. For the alignment step, as well as using a merged genome assembly file
(FASTA) the user will also need to generate a merged genome gene annotation file
(GFF) with the same chromosome labels. As well as a dataframe contain
sample data where rows represent samples, and column 1 stores the sample names,
column 2 stores the fastq file name for mate 1 and, for pair-end alignment,
column 3 stores the fastq file name for mate 2. While for all other downstream
functions simply change some of the parameters to set the option to mRNA input
data type.
The quick start analysis allows for the retrieval of a data frame the sRNA total population in the experimental design and also the candidate mobile sRNAs. Users may want to advance the analysis and plot the data:
- Exploratory and quality control analysis, such as PCA and distance matrices.
- Summary values including RPM mean and Count mean across specific samples.
- Plotting of the distribution of sRNA classes and the consensus dicercall across individual replicates or across the data set.
- Volcano plot for mRNA data
Advanced features also include tools to assist functional analysis:
- Identify genomic features associates with the sRNA clusters (ie. to explore the RNA expression of sRNA-producing genes in parallel analysis)
- Extract consensus RNA sequence for target prediction analysis
Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2014. “HTSeq – A Python framework to work with high-throughput sequencing data.” Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btu638.
Axtell, MJ. 2013. “ShortStack: comprehensive annotation and quantification of small RNA genes.” Rna 19: 740–51. https://doi.org/10.1093/nar/gky316.
Danecek P, Liddle J, Bonfield JK. 2021. “Twelve years of SAMtools and BCFtools.” GigaScience 10. http://dx.doi.org/10.1186/s13059-014-0550-8.
Kim, Langmead, D. 2015. “HISAT: a fast spliced aligner with low memory requirements.” Nature Methods 12. https://www.nature.com/articles/nmeth.3317.
Last updated: 18-10-2023