A series of R notebooks for analysing single cell RNA sequencing data.
This collection of R notebooks has been designed to guide you through processing and analysing your single cell RNA (scRNA) sequencing data. They are designed to be worked through in the following order:
- Quality control
- Doublet detection
- Dataset integration
- Cell annotation
- Pseudobulking and differential gene expression analysis
- Pathway enrichment analyses.
Each notebook explains what is happening in each step, complete with code and rationales for the choices we have made in our approach.
It is important to note there is no one single way to pre-process scRNA data - there are as many ways as there are different software packages and libraries for scRNA analysis, and the limitless ways to use each of their tools and functions.
The workflow presented in these notebooks is the synthesis of best practices, studies, and discussions of how to analyse scRNA data with a focus on using Seurat in R. Footnotes and external links accompany the text throughout the document - please view these for useful additional information and rationale on why steps are done in certain ways.
This content primarily uses the Seurat R package, but the way and order things are run differs vastly from their tutorials. We like to note that the Satija lab Seurat tutorials are instructions on how to use the package, but not how to conduct robust scRNA pre-processing and analysis. This content leverages the flexibility of the Seurat package, but is supplemented by the practices outlined in existing resources. These resources are the most influential:
- Current best practices in single-cell RNA-seq analysis: a tutorial (Luecken and Theis, 2019)
- scRNAseq analysis in R with Seurat (Williams and Perlaza, 2024)
- Spatial Sampler (Williams, 2025)
The workflow is split into six sections: quality control, doublet detection, dataset integration, cell annotation, pseudobulking and differential gene expression, and functional enrichment analysis:
This workflow has been designed to be run after initial pre-processing with the nf-core/scrnaseq Nextflow pipeline. That pipeline takes the raw sequencing data and performs genome alignment and counting of reads/UMIs per gene and per cell. The nf-core/scrnaseq pipeline outputs each sample's count matrices and metadata in multiple formats, including an R data file (.Rds) containing a Seurat data object. Your data must be in this format to begin working through these notebooks. Importantly, this pipeline expects the filtered Seurat data, which has had non-cell barcodes removed. For version 4.0.0 of the nf-core/scrnaseq pipeline (the latest version at the time of publishing these notebooks), these RDS files can be found in the results/ output directory at results/cellranger/mtx_conversions/<SAMPLE_ID>/<SAMPLE_ID>_filtered_matrix.seurat.rds.
The first notebook in this workflow takes the input .Rds files containing your pre-processed Seurat data - one file per sample - and performs some basic quality control analyses to detect and remove low quality cells. This is an interactive process where you will be required to select thresholds for filtering. The notebook includes some interactive plots and figures to help guide your decisions in this process. We also perform initial normalisation and transformation of your count data to account for library size differences between samples.
The output from this first stage is a new series of .Rds files - again, one per sample - containing your filtered data.
The second notebook takes the output files from the quality control notebook and works through identifying doublets in your data. The cell capture process for single cell sequencing is not perfect and can result in multiple cells being captured together and given the same cellular barcode. This typically only affects a small proportion of the cell barcodes, and methods are available to detect these doublet and multiplet barcodes. In this notebook, we use the R library DoubletFinder for this purpose. By default, we remove these doublets from your data, as they will confound your results, although we give you the option of leaving them in and simply having them annotated as such.
The output from this stage is another series of .Rds files containing the doublet-free (or doublet-annotated) data - one file per sample.
The third notebook in this workflow takes the output of the doublet detection stage and performs dataset integration. This is a vital step that merges your data into a single object and helps to account for batch effects between samples. Without this step, you may find that cells will form clusters based solely on the sample they are from rather than by true biological differences. These batch effects will confound your analyses and make interpretation of results difficult or impossible.
In addition, this notebook works through a final round of data transformation and normalisation, followed by dimensionality reduction and cell clustering. These normalised data and clusters will be used in all downstream analyses.
The output from this stage is a single .Rds file containing the merged and integrated Seurat dataset. All downstream analyses will be performed on this single dataset.
The fourth notebook in this workflow takes the merged, integrated, and normalised data from the previous notebook and performs cell annotation.
We automatically annotate cells by cell cycle and cell type using public databases. We also provide you with an opportunity to supply curated marker gene lists for cell types that you are interested in, which we use to score and annotate your cells with.
The output from this notebook is an .Rds file containing your annotated single cell data
After annotation, we perform pseudobulking, which sums together the counts from all cells within a cluster and treats the cluster like a single sample in a bulk RNA sequencing analysis. This has some important advantages, primarily allowing us to use existing bulk RNA sequencing tools and simpler, higher-powered statistical tests for analysing your data.
The pseudobulked data is then used to perform differential gene expression analysis and pathway enrichment analysis.
The outputs from this notebook are:
- An
.Rdsfile containing your pseudobulked data - An
.Rdsfile containing your differential expression results.
The final notebook runs functional enrichment analysis (FEA), which helps to identify common pathways or gene sets that are enriched for differentially expressed genes. We run two forms of FEA: over-representation analysis (ORA) and gene set enrichment analysis (GSEA), which differ in their methodologies and provide two complementary analyses of the gene sets that are enriched in your data.
The output from this notebook is a collection of web reports summarising the pathway enrichment analyses.
The notebooks are written in the Quarto format - a format very closely related to R markdown. This format allows code to be interspersed with human-friendly text that explains what we are doing at each step. It also allows you to generate a styled HTML document at the end to save a record of the analyses you have run.
We recommend using RStudio to run each notebook within the notebooks/ directory. We also recommend running the notebooks in Visual mode, as it helps to distinguish between the text and code blocks and reduces distractions from formatting code:
Each chunk must be run sequentially. This ensures reproducibility and that objects saved in your R environment do not get mixed up.
At the end of each notebook, we also recommend restarting your R session to clear large objects from the workspace.
Some chunks will require your input for setting parameters that will be unique to your data.
At other points, we will generate template files within the inputs/ directory that you will need to edit in order to proceed.
In both cases, the notebooks will use the following alerts to let you know when an action is required:
❱❱❱ ACTION ❰❰❰
- This is an example of an action that you will need to complete before proceeding.
Single cell sequencing data is typically quite large, and processing more than a handful of samples can quickly require more computing resources than your typical laptop or desktop computer will have.
While these notebooks will work on your local computer, we have designed them with high-performance computing environments in mind. We recommend using a cloud- or HPC-hosted RStudio server to run these notebooks. We have tested the notebooks successfully on NCI's Australian Research Environment (ARE) - a web-based interface to the Gadi HPC, with the ability to run an RStudio server with the resources necessary to process large numbers of samples together.
As a consequence, we also only recommend running on Unix-like systems (e.g. Linux and Mac). These notebooks should also run on Windows, but they haven't been fully tested on that platform and so it is possible you may run into unexpected issues. Most HPC- and cloud-based environments are based on Linux and as such these notebooks will run well on these platforms.
Once you have run through all of the notebooks, you can render ("Knit") everything into a human-friendly HTML document. You can do this by running the following command in a terminal, within the top-level project directory:
quarto render
When rendering, the notebooks will avoid running expensive operations and will instead use the saved data objects created when running the notebooks interactively. This ensures that they render quickly and efficiently.
These notebooks are based on the R programming language and use a number of bioinformatics R packages, in particular Seurat for single cell sequencing analysis. We have provided an R script in this repository at install/install.R which will install all the required packages.
The notebooks have been tested with R version 4.4.2. Other versions of R may work but are untested at this stage. We recommend using this version of R if possible.
If you are running the notebooks locally on a desktop or laptop computer, you can simply run the installation R script like so:
Rscript install/install.ROn Mac, this install script should run without any other prerequisites.
If you are installing the libraries on Windows, you will additionally need to install RTools prior to running the installation script.
Please note that local computers may struggle to handle larger datasets due to the large memory requirements, especially during the doublet detection and integration steps. If running locally, we recommend having at least 32GB of memory available.
The notebooks are intended to be run on NCI's Australian Research Environment (ARE) platform, as this provides a way of running a web-based interactive R session on a high-performance computing system using an RStudio server. However, installing the required R packages on this system can be a little tricky, so we have pre-installed the necessary libraries on the if89 NCI project. This is the Australian BioCommons Tools and Workflows project, which hosts a range of common bioinformatics tools, reference datasets, containers, and workflows that are available to all NCI users. If you are already an NCI user, you may request access to this project via the NCI web portal. You can find more information about this project on the Australian BioCommons GitHub Pages site.
The pre-installed R libraries are located in the if89 gdata storage: /g/data/if89/R/scrna-analysis.
If you would prefer to install the required packages on NCI yourself into your own directories, we have also provided a bash script for this purpose - install/install_nci.sh. To use this script, you will first need to log into NCI's gadi:
# Replace "user" with your NCI username
ssh user@gadi.nci.org.auNext, clone this repository to a convenient location. This will also be where you will be running the notebooks, so it is a good idea to choose a location with a large amount of storage space. We recommend using the scratch filesystem as a temporary location for running these notebooks.
# Replace "project" with your NCI project code
# or choose another location
cd /scratch/project/
git clone https://github.com/Sydney-Informatics-Hub/scrna-analysis.git
cd scrna-analysis/installThe installation script can be run interactively on the login node with the following command:
./install_nci.shBy default, it will perform a dry run of the installation by telling you where the R libraries will be installed. It will also print out the R_LIBS_USER environment variable definition you will need to use later on to run the notebooks (see Running on ARE below). The command to set this will also be saved in a new file called install/setenv.sh.
R libraries will be installed to the following path:
/g/data/project/R/scrna-analysis/4.4
When running the notebooks, you will need to set the R_LIBS_USER environment variable to this path:
R_LIBS_USER=/g/data/project/R/scrna-analysis/4.4
*** DRY RUN ONLY ***
To submit the installation job to the cluster, run this script again with the --submit flag, or run the following command:
qsub -P project -l storage=gdata/project+scratch/project -v PREFIX='/g/data/project' install_nci.submit.shcat setenv.shR_LIBS_USER=/g/data/project/R/scrna-analysis/4.4Note that by default, the installation path will be /g/data/project/R/scrna-analysis/4.4, where project is your default NCI project code. You can override the project by using the --project parameter:
./install_nci.sh --project ab01R libraries will be installed to the following path:
/g/data/ab01/R/scrna-analysis/4.4
When running the notebooks, you will need to set the R_LIBS_USER environment variable to this path:
R_LIBS_USER=/g/data/project/R/scrna-analysis/4.4
*** DRY RUN ONLY ***
To submit the installation job to the cluster, run this script again with the --submit flag, or run the following command:
qsub -P ab01 -l storage=gdata/ab01+scratch/ab01 -v PREFIX='/g/data/ab01' install_nci.submit.shYou can also select a different installation prefix with the --prefix parameter. The installation path will always be ${PREFIX}/R/scrna-analysis/4.4:
./install_nci.sh --project ab01 --prefix /scratch/ab01R libraries will be installed to the following path:
/scratch/ab01/R/scrna-analysis/4.4
When running the notebooks, you will need to set the R_LIBS_USER environment variable to this path:
R_LIBS_USER=/scratch/ab01/R/scrna-analysis/4.4
*** DRY RUN ONLY ***
To submit the installation job to the cluster, run this script again with the --submit flag, or run the following command:
qsub -P ab01 -l storage=gdata/ab01+scratch/ab01 -v PREFIX='/scratch/ab01' install_nci.submit.shOnce you are ready to submit the installation job to the cluster, add the --submit flag:
./install_nci.sh --project ab01 --prefix /scratch/ab01 --submitThe installation process may take ~2h to complete. Once finished, inspect the output logs to ensure all packages were correctly installed. A successful run should complete with an exit status of 0. You can check this on NCI by looking at the resource usage summary that NCI adds to the end of the standard output log file, which will be named like install_nci.submit.sh.o<JOBID>, where JOBID is the numeric ID given to the job when it ran. For example:
tail -n 12 install_nci.submit.sh.o123456789======================================================================================
Resource Usage on 2025-01-01 12:00:00:
Job Id: 123456789.gadi-pbs
Project: ab01
Exit Status: 0
Service Units: 13.36
NCPUs Requested: 1 NCPUs Used: 1
CPU Time Used: 02:05:06
Memory Requested: 8.0GB Memory Used: 3.36GB
Walltime requested: 04:00:00 Walltime Used: 03:20:25
JobFS requested: 64.0GB JobFS used: 687.52MB
======================================================================================It is possible that you may have received a non-zero exit status due to warning messages that can be safely ignored. To double-check that the required packages were installed correctly, you can run the test_installation.R script. This will try to load all the required packages and print a success or failure message at the end:
Rscript --vanilla test_installation.ROn success:
...
=== REQUIRED PACKAGES WERE SUCCESSFULLY INSTALLED ===
On failure:
...
=== REQUIRED PACKAGES FAILED TO INSTALL CORRECTLY ===
If the packages did not successfully install, you can try running the script again. Packages that were already installed should be skipped and so the install time will be considerably shorter. Also double-check the following:
- You have enough storage allocation for your install location.
- By default, the script installs to the gdata allocation of your default NCI project, or your scratch allocation if you don't have a gdata allocation.
- You have enough compute allocation for the installation.
- The installation process is very light on resources - on the order of 10SU - so this shouldn't be a problem unless you have already exceeded your quarterly quota.
- The job didn't run over the walltime allocation.
- The script has 4 hours of walltime allocated to it, which should be more than enough to complete the process. You can check how long the job took by inspecting the resource usage summary as above. If the
Walltime Usedsection is more than 4 hours, you can try updating the walltime allocation by following the instructions below under Updating Installation Resource Allocations.
- The script has 4 hours of walltime allocated to it, which should be more than enough to complete the process. You can check how long the job took by inspecting the resource usage summary as above. If the
- The job didn't require more memory than requested.
- The script has 8GB of memory allocated to it, which should also be enough to complete the installation. Check the resource usage summary to double check this wasn't exceeded. If it was, you can update the memory request. See Updating Installation Resource Allocations below for more details.
If your installation job failed because it ran out of resources (e.g. walltime or memory), you can manually update the installation script's header section to request more resources. The actual installation script - install_nci.submit.sh - has several lines at the top that start with #PBS. These lines are read by NCI's PBS scheduler software to determine the resources to give to the job.
#!/bin/bash
#PBS -q copyq
#PBS -l mem=8GB
#PBS -l jobfs=64GB
#PBS -l walltime=04:00:00
#PBS -l wdIf you need more memory you can change the line #PBS -l mem=8GB, e.g.:
#!/bin/bash
#PBS -q copyq
#PBS -l mem=16GB
#PBS -l jobfs=64GB
#PBS -l walltime=04:00:00
#PBS -l wdSimilarly, you can increase the requested walltime by updating the line #PBS -l walltime=04:00:00:
#!/bin/bash
#PBS -q copyq
#PBS -l mem=8GB
#PBS -l jobfs=64GB
#PBS -l walltime=08:00:00
#PBS -l wdHere we provide step-by-step instructions for specifically running these notebooks on NCI's ARE platform. This assumes you have already installed all the required R packages (or are using the provided libraries on the if89 project) and cloned the repository to a convenient location on Gadi by following the instructions above in Installation on NCI.
First, in a web browser, navigate to are.nci.org.au. Follow the prompts to log in using your NCI credentials.
On the main ARE dashboard, under "All Apps", select "RStudio". Do not select "RStudio (Rocker image)", as this is an older version of the RStudio app and isn't supported by these notebooks.
On the new page that appears, you will be presented with a number of parameters to configure for your RStudio session. There is also a checkbox labelled "Show advanced settings", which you will need to select.
Use the table below to fill in the required parameters. If you don't see the input box for the parameter, ensure you have selected "Show advanced settings" first.
| parameter | value | notes |
|---|---|---|
| Walltime (hours) | 8 |
It is better to request more than you will need as you won't be charged for time that isn't used. |
| Queue | normalbw |
|
| Compute Size | Custom (cpus=1 mem=256G) |
Some of the steps in these notebooks require a lot of resources, so we recommend using a custom compute size with 1 CPU and 256GB. These notebooks are single-threaded, so more than 1 CPU is unnecessary, and 256GB is the maximum amount of memory you can request for a node on the normalbw queue. |
| Project | Your NCI project code | |
| Storage | gdata/project+scratch/project+gdata/if89 |
Replace project with your NCI project code. If you installed the R libraries yourself rather than using the if89 pre-installed libraries, you can omit the final +gdata/if89 specification. |
| Modules | R/4.4.2 gcc/14.2.0 |
These notebooks are based on R version 4.4.2 and so should only be used with this version. Additionally, the pre-installed packages in if89 use this version of R. The notebooks also require the gcc version 14.2.0 module to be loaded. |
| Environment variables | R_LIBS_USER="/path/to/scrna-analysis/libraries",XDG_DATA_HOME="/scratch/<PROJECT>/<USER>/.local/share" |
There are two environment variables to set here. R_LIBS_USER tells R where to find the necessary libraries for running the notebooks, while XDG_DATA_HOME is used by RStudio to place various files and data. They should be separated by a comma. These variables are explained in further detail below. |
The two environment variables that need to be set are R_LIBS_USER and XDG_DATA_HOME.
The value for R_LIBS_USER will vary depending on whether you are using the if89 pre-installed R libraries or if you ran install/install_nci.sh. If using the if89 libraries, this should be set to the path /g/data/if89/R/scrna-analysis. If you installed the libraries yourself, use the path that was saved inside install/setenv.sh when you ran the installation script (see Installation on NCI above).
Additionally, you will want to set the XDG_DATA_HOME environment variable. By default, RStudio will place working files and data in your home directory under ~/.local/share, but on NCI your home directory has limited storage space and will quickly fill up. Instead, we recommend setting this variable to somewhere in the NCI scratch space, e.g. /scratch/ab01/usr012/.local/share, assuming a project ID of ab01 and a user ID of usr012.
Once set up, your settings should look something like this:
We recommend saving your settings so that you can quickly start a new session in the future. At the bottom of the page, click the checkbox labelled "Save settings". In the box below that, type a name for your saved settings and click "Save settings and close". This will take you to a new page with a list of your saved settings. At the top right of this list is a play button arrow. Click this to launch a new session of RStudio with your saved settings.
You will be brought to a new page that shows the status of your session. It will start out as "Queued", but within a few minutes it should show the status as "Starting" and then "Running". Once running, a button will appear labelled "Connect to RStudio Server". Click this to open RStudio in a new browser tab.
Within RStudio, you can use the file browser at the lower right side to navigate to where you cloned the repository and start working through the notebooks.
You can access your saved settings anytime by going to the My Interactive Sessions page in the ARE dashboard. Under "Saved Settings" you should see the name you gave your settings. Clicking this link brings you back to the page where you can launch your session.













