Skip to content

TAPE-Lab/splitRtools

Repository files navigation

splitRtools: Preprocessing tools for SPLiT-seq data

License: MIT

Welcome to the splitRtools package!

This package is under active development and all functionality is not yet validated!!

The package may change significantly over development

⏬ Installation

The package can be installed from this github repository:

# Install devtools for github installation if not present
require(devtools)

# Install package from github repo
devtools::install_github("https://github.com/JamesOpz/splitRtools")

Overview

The splitRtools package is a collection of tools that are used to process SPLiT-seq scRNA-seq data first published in Rosenberg et.al, 2019.

The splitRtools package is designed to take as input data, the various outputs from the zUMIs package (paper) for scRNA-seq barcode mapping and alignment. The zUMIs package is used to take raw FASTQ output, assign and filter reads to barcodes, then map the cDNA reads to a reference genome using STAR producing a CellxGene matrix, as well as some reporting about the pipeline outputs.

A sample zUMIs pipeline with configuration to work with the Rosenberg-2019 barcode setup is available here.

Running the splitRtools pipeline

Data input directory structure

data_folder

The splitRtools pipeline depends on the naming of the zUMIs pipeline barcodes/read mapping output. All zUMIs outputs for each sublibrary must be contained within a folder with the same name as the zUMI experiment name. This is the name embedded into each zUMIs output file. The zUMIs sublibrary output folder must also be named the same as this zUMIs experiment name. The folders for each individual sublibrary must be contained withing the data_folder and this folder’s absolute path must be specified in the run_split_pipe() arguments.

fastq_path

The other input folder is the FASTQ folder containing the raw data used as input for the zUMIs mapping pipeline. This allows zUMIs to calculate the total reads from each sublibrary to calculate several metrics relating to the experimental sequencing depth. The absolute path for this folder is specified in the fastq_path arguments of the run_split_pipe() function.

File input structure

|
|–data_folder
|          |
|          |-sub_lib_1
|          |       |-sub_lib_1.BCstats.txt
|          |       |-zUMIs_output
|          |
|          |-sub_lib_2
|          |-sub_lib_n
|
|-fastq_path
          |
          |sub_lib_1
          |       |-sub_lib_1_R1.fastq.gz
          |       |-sub_lib_1_R2.fastq.gz
          |
          |-sub_lib_2
          |-sub_lib_n

Barcode maps

The experiment barcoding layout must be provided as a csv file with two columns - well position (numeric: 1-96) and barcode sequence in each well. Currently splitRtools supports one barcoding layout for the RT plate (args rt_bc) and another for the two subsequent ligation rounds (args lig_bc). An example of the barcoding layout sheet (Rosenberg 2019 format) is located in this repository in data/barcodes_v1.csv.

Sample maps

Similar to the barcoding layout, the sample layout for the RT barcode indexing needs to be provided as - well position and sample_id. This enables the labelling of each cell with its sample of origin and is specified in arg sample_map. An example of the sample map layout sheet is located in this repository in data/cell_metadata.xlsx.

Executing the pipeline

The splitRtools pipeline is run through the run_split_pipe() function, which acts as a wrapper function to execute the pipeline. A basic setup for the pipeline is as follows: (for more information on pipeline arguments use ?run_split_pipe)

# Load splitRtools
library(splitRtools)

# Run the splitRtool pipeline
# Each sublibrary is contained within its own folder in the data_folder folder and must contain zUMIs output, named by sublib name.
run_split_pipe(mode = 'single', # Merge sublibraries or process separately.
               n_sublibs = 1, # How many to sublibraries are present
               data_folder = "~/experiment/hpc_outputs/", # Location of zUMIs data directory
               output_folder = "~/experiment/pipe_output", # Output folder path
               filtering_mode = "manual", # Filter by knee (standard) or manual value (default 1000, 500 in this case) transcripts
               filter_value = 500, # UMI filter value to determine intact cells.
               count_reads = FALSE, # Count FASTQ files in fastq_path.
               total_reads = 22741884, # Provide read count of single sublibrary.
               fastq_path = NA, # Path to folder containing subibrary raw FastQ data.
               rt_bc = "~/experiment/hpc/barcode_maps/barcodes_v2_48.csv", # RT barcode map
               lig_bc = "~/experiment/hpc/barcode_maps/barcodes_v1.csv", # Ligation barcode map
               sample_map = "~/experiment/barcode_maps/exp013_cell_metadata.xlsx" # RT plate layout file
               
)

Pipeline outputs

Output directory structure

|
|–output_folder
          |
          |-sub_lib_1
          |       |-unfiltered_sce_h5ad_objects
          |       |-filtered_sce_h5ad_ojects
          |       |-ggplot_outputs
          |       |-report_data_outputs
          |
          |-sub_lib_2
          |-sub_lib_n
          |-merged_sublibrary_data

Output data

The first stage of the pipeline labels converts the cell count matrix into a SingleCellExperiment object and labels each cell with various ColData with a series of well IDs based each stage of the barcoding process and the correspondence between the RT wells ID and the sample_map .xlsx file provided. This data is then stored as an SCE or an annData object in unfiltered/ output folder for each sublibrary.

Diagnostic plots

The splitRtools pipeline will generate a set of diagnostic plots in order to evaluate the initial quality of the SPLiT-seq scRNA-seq data.

After labeling the data is filtered using either the DropletUtils package spline-fitting functionality or a user specified manual cutoff of transcripts. This produces the following waterfall plot along with quantifiaction of the cell types recovered by sample:




The barcoding cell data is then mapped to the respective plate locations across the 3 barcoding rounds to provide a series of heatmaps displaying cells recovered per well and median UMI per cell across all wells:

About

Collection of tools to process split-seq data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages