The package can be installed from this github repository:
# Install devtools for github installation if not present
require(devtools)
# Install package from github repo
devtools::install_github("https://github.com/JamesOpz/splitRtools")
The splitRtools package is a collection of tools that are used to
process SPLiT-seq scRNA-seq data first published in Rosenberg et.al,
2019.
The splitRtools package is designed to take as input data,
the various outputs from the zUMIs
package
(paper)
for scRNA-seq barcode mapping and alignment. The zUMIs package is used
to take raw FASTQ output, assign and filter reads to barcodes, then map
the cDNA reads to a reference genome using STAR producing a CellxGene
matrix, as well as some reporting about the pipeline outputs.
A sample zUMIs pipeline with configuration to work with the
Rosenberg-2019 barcode setup is available
here.
The splitRtools pipeline depends on the naming of the zUMIs pipeline
barcodes/read mapping output. All zUMIs outputs for each sublibrary must
be contained within a folder with the same name as the zUMI experiment
name. This is the name embedded into each zUMIs output file. The zUMIs
sublibrary output folder must also be named the same as this zUMIs
experiment name. The folders for each individual sublibrary must be
contained withing the data_folder
and this folder’s absolute path must
be specified in the run_split_pipe()
arguments.
The other input folder is the FASTQ folder containing the raw data used
as input for the zUMIs mapping pipeline. This allows zUMIs to calculate
the total reads from each sublibrary to calculate several metrics
relating to the experimental sequencing depth. The absolute path for
this folder is specified in the fastq_path
arguments of the
run_split_pipe()
function.
|
|–data_folder
| |
| |-sub_lib_1
| | |-sub_lib_1.BCstats.txt
| | |-zUMIs_output
| |
| |-sub_lib_2
| |-sub_lib_n
|
|-fastq_path
|
|sub_lib_1
| |-sub_lib_1_R1.fastq.gz
| |-sub_lib_1_R2.fastq.gz
|
|-sub_lib_2
|-sub_lib_n
The experiment barcoding layout must be provided as a csv file with two
columns - well position (numeric: 1-96) and barcode sequence in each
well. Currently splitRtools
supports one barcoding layout for the RT
plate (args rt_bc
) and another for the two subsequent ligation rounds
(args lig_bc
). An example of the barcoding layout sheet (Rosenberg
2019 format) is located in this repository in data/barcodes_v1.csv
.
Similar to the barcoding layout, the sample layout for the RT barcode
indexing needs to be provided as - well position and sample_id. This
enables the labelling of each cell with its sample of origin and is
specified in arg sample_map
. An example of the sample map layout sheet
is located in this repository in data/cell_metadata.xlsx
.
The splitRtools pipeline is run through the run_split_pipe()
function,
which acts as a wrapper function to execute the pipeline. A basic setup
for the pipeline is as follows: (for more information on pipeline
arguments use ?run_split_pipe
)
# Load splitRtools
library(splitRtools)
# Run the splitRtool pipeline
# Each sublibrary is contained within its own folder in the data_folder folder and must contain zUMIs output, named by sublib name.
run_split_pipe(mode = 'single', # Merge sublibraries or process separately.
n_sublibs = 1, # How many to sublibraries are present
data_folder = "~/experiment/hpc_outputs/", # Location of zUMIs data directory
output_folder = "~/experiment/pipe_output", # Output folder path
filtering_mode = "manual", # Filter by knee (standard) or manual value (default 1000, 500 in this case) transcripts
filter_value = 500, # UMI filter value to determine intact cells.
count_reads = FALSE, # Count FASTQ files in fastq_path.
total_reads = 22741884, # Provide read count of single sublibrary.
fastq_path = NA, # Path to folder containing subibrary raw FastQ data.
rt_bc = "~/experiment/hpc/barcode_maps/barcodes_v2_48.csv", # RT barcode map
lig_bc = "~/experiment/hpc/barcode_maps/barcodes_v1.csv", # Ligation barcode map
sample_map = "~/experiment/barcode_maps/exp013_cell_metadata.xlsx" # RT plate layout file
)
|
|–output_folder
|
|-sub_lib_1
| |-unfiltered_sce_h5ad_objects
| |-filtered_sce_h5ad_ojects
| |-ggplot_outputs
| |-report_data_outputs
|
|-sub_lib_2
|-sub_lib_n
|-merged_sublibrary_data
The first stage of the pipeline labels converts the cell count matrix
into a SingleCellExperiment
object and labels each cell with various
ColData
with a series of well IDs based each stage of the barcoding
process and the correspondence between the RT wells ID and the
sample_map
.xlsx file provided. This data is then stored as an SCE
or an annData
object in unfiltered/
output folder for each
sublibrary.
The splitRtools pipeline will generate a set of diagnostic plots in
order to evaluate the initial quality of the SPLiT-seq scRNA-seq data.
After labeling the data is filtered using either the
DropletUtils
package spline-fitting functionality or a user specified
manual cutoff of transcripts. This produces the following waterfall plot
along with quantifiaction of the cell types recovered by sample:
The barcoding cell data is then mapped to the respective plate
locations across the 3 barcoding rounds to provide a series of heatmaps
displaying cells recovered per well and median UMI per cell across all
wells: