Multi-resolution consensus clustering evaluation

Introduction

This repository contains a pipeline for comparing single-cell RNA sequencing (scRNA-seq) clustering methods used to evaluation the multi-resolution consensus clustering method (MRCC). The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It was based on the nf-core template (although most of the bells and whistles have been removed).

Directory structure

bin/ - Scripts for stages of the pipeline
conf/ - Configuration files
datasets/ - Scripts for creating dataset input files. The created files should also be placed here for the pipeline to run.
envs/ - Conda environment YAML files for pipeline stages
workflows/ - NextFlow files specifying the pipeline
LICENSE - MIT license
main.nf - The main NextFlow pipeline file
nextflow.config - The main NextFlow configuration file
README.md - This README

Pipeline summary

The pipeline follows a standard benchmarking format of datasets, methods and metrics. Each dataset is prepared and passed to each method and the results of the method are scored using the metrics. This is done in a combinational way so the number of stages is roughly number of datasets x number of methods x number of metrics.

Datasets

The pipeline uses a selection of real and simulated datasets for evaluation. Scripts for creating the datasets are available in datasets/ but are not part of the pipeline and need to be run in advance.

Real datasets

The pipeline uses several real datasets for evaluation, each of these has pre-existing curated labels. We have selected a single sample from each dataset in order to avoid needing to consider batch effects when running the methods.

Azimuth kidney (CZIKidney763280) - CZIKidney763280 sample from the Azimuth example kidney query dataset
Azimuth lung (Dropseq-2) - Dropseq-2 sample from the Azimuth example lung query dataset
Azimuth bone marrow (batch2) - batch2 sample from the Azimuth example bone marrow query dataset
Azimuth mouse motor cortex (352357) - 352357 sample from the Azimuth example mouse motor cortex query dataset
COMBAT (S00052-Ja005E-PBCa) - S00052-Ja005E-PBCa sample from the COvid-19 Multi-omics Blood ATlas (COMBAT) study
GCA (F73-FPIL-0-SC-1) - F73-FPIL-0-SC-1 sample from the Gut Cell Atlas (GCA)
NeurIPS (site2-donor1) - site2-donor1 sample from the Open Problems in Single-cell Analysis NeurIPS 2021 multimodal integration challenge

Simulated datasets

Simulated datasets were generated using the {splatter} package and the scripts in datasets/.

Simulation (blob) - Simulation with one cell group (i.e. a single cluster)
Simulation (groups) - Simulation with multiple cell groups (i.e. multiple clusters)
Simulation (path) - Simulation with a continuous transition between two cell types
Simulation (rare) - Simulation with multiple cell groups where some of those groups have low occurrences (1-5%)

Methods

Both standard scRNA-seq processing workflows and scRNA-seq consensus clustering methods were selected for comparison.

MRCC - This is the primary method evaluated by the pipeline. It is run in two stages. First multiple clusterings are performed using a standard method. Second those clusterings are combined using the newly developed consensus approach. This allows to us to test multiple parameter sets for the combining stage without having to repeat the more computationally intensive clustering stage see Usage.
Random - Random assignment of labels as a negative control. The number of labels is the same as the number of labels in the dataset.
SC3 - Single-Cell Consensus Clustering. A consensus clustering method designed for scRNA-seq data. The dataset is clustered multiple times using k-means on different distance metrics and transformations of the data.
Scanpy - Scanpy is the most used Python toolbox for scRNA-seq analysis. The standard Scanpy workflow makes use of graph-based clustering and is comparable to the Seurat workflow.
Seurat - Seurat is the most used R toolbox for scRNA-seq analysis. The standard workflow makes use of graph-based clustering and is comparable to the Scanpy workflow.
SIMLR - Consensus clustering based on optimising different distance kernels.

Metrics

Metrics are divided into two categories: unsupervised metrics which compare clustering assignments to the ground truth labels but do not require them to be matched and supervised metrics which treat the task as a classification problem and require clustering assignments to be matched to the ground truth labels.

Unsupervised metrics

Adjusted Mutual Information
Adjusted Rand Index
Completeness score
Element-Centric Clustering Similarity (Implemented in the ClustAssess package)
Fowlkes-Mallows Index
Homogeneity score

Supervised metrics

Setup

Install Nextflow (>=21.10.3)
Install Conda
Download the pipeline by cloning the repository or as a ZIP file
Run the scripts in datasets/ to create input dataset files

Usage

The pipeline can be run using:

nextflow run main.nf

By default this will just run a small test dataset. To run on the full datasets a parameters file needs to be provided (see Parameters). For example, to run on all datasets using the provided parameters file use:

nextflow run main.nf -params-file conf/all-datasets.yml

To run the pipeline on a high-performance computing system with a submission queue you need to supply a profile configuration. An example for the HMGU slurm cluster is provided but you should refer to the NextFlow docs for how to design a profile for your system.

nextflow run main.nf -profile hmgu-slurm

Parameters

The parameters file can be used to define both datasets and parameter sets for the MRCC method.

Datasets

Datasets are defined using the following YAML:

input:
    - name: Dataset1       # Shouldn't have spaces, other unusual symbols
      file: Dataset1.h5ad  # Path to a H5AD file containing the dataset
      labels: Labels       # Name of the `.obs` column containing cell labels
    - name: Dataset2
      file: Dataset2.h5ad
      labels: CellLabels

MRCC parameters

Parameters sets for the MRCC method are defined using the following YAML. See the bin/method-mrcc.py script for more description of the parameters.

mrcc:
    - name: MRCC_N_Le_SR_1_MR_1    # Name of the parameter set
      graph_type: neighbour        # Method for building the multi-resolution graph
      community_type: leiden       # Community detection method
      single_resolution: 1         # Community detection resolution for single-resolution graphs
      multi_resolution: 1          # Community detection resolution for the multi-resolution graph
    - name: MRCC_A_Le_SR_1_MR_1
      graph_type: all
      community_type: leiden
      single_resolution: 1
      multi_resolution: 1

Output

Output of the pipeline will be created in the results/ directory. This includes the clustering output from each method, the metric scores and some basic summary plots. The pipeline trace (runtime etc.) is also available in the pipeline_trace/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-resolution consensus clustering evaluation

Introduction

Directory structure

Pipeline summary

Datasets

Real datasets

Simulated datasets

Methods

Metrics

Unsupervised metrics

Supervised metrics

Setup

Usage

Parameters

Datasets

MRCC parameters

Output

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
bin		bin
conf		conf
datasets		datasets
envs		envs
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

License

theislab/consensus-clustering-evaluation

Folders and files

Latest commit

History

Repository files navigation

Multi-resolution consensus clustering evaluation

Introduction

Directory structure

Pipeline summary

Datasets

Real datasets

Simulated datasets

Methods

Metrics

Unsupervised metrics

Supervised metrics

Setup

Usage

Parameters

Datasets

MRCC parameters

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages