A comprehensive analysis pipeline for disentangling cell type and cell state variation in multi-condition single-cell RNA-sequencing data using interpretable machine learning approaches.
This project implements and compares multiple computational approaches for separating cell type (stable biological identity) and cell state (dynamic condition-specific changes) in single-cell RNA-sequencing datasets. The analysis includes:
- Patches: A VAE-based approach for interpretable disentanglement of biological variation (Beker et al., 2025)
- DISCoVeR: Deep generative modeling for discovering latent representations (Slavutsky et al., 2025)
- Feature Selection Framework (FSF): Classical statistical approaches for gene selection (Wang et al., 2025)
The pipeline evaluates these methods on both simulated (Wang et al., 2025) and experimental datasets (Kang et al., 2018), using multiple downstream analyses including clustering (ARI), mixing (LISI), principal component regression (PCR), and mutual information (MI) metrics.
This repository is associated with the ETH Zurich semester project:
"Exploring Methods for Disentangling Cell Type and Cell State in Multi-Condition scRNA-seq Data"
Student: Michel Tarnow
Supervisors: Jiayi Wang, Prof. Dr. Mark D. Robinson
.
├── data/ # Data storage (gitignored)
│ ├── kang/ # Experimental data (Kang18)
│ │ ├── 00-raw/ # Raw data
│ │ ├── 01-pro/ # Processed data
│ │ ├── 02-sco/ # Gene scores
│ │ ├── 03-sel/ # Gene selections
│ │ └── 04-emb/ # Embeddings
│ └── sim/ # Simulated data
│ └── ... # Same structure as kang/
├── logs/ # Log files (gitignored)
├── notebooks/ # Analysis notebooks
│ ├── python/ # Python notebooks
│ │ ├── 00-data_*.ipynb
│ │ ├── 01-patches_*.ipynb
│ │ ├── 02-discover_*.ipynb
│ │ └── 03-mutual_info_*.ipynb
│ └── r/ # R notebooks
│ ├── 00-dat_*.qmd
│ ├── 01-sco_*.qmd
│ ├── 02-sel_*.qmd
│ └── 03-da_*.qmd
├── outs/ # Output figures and results
│ ├── kang/
│ └── sim/
├── src/ # Source code
│ ├── discover/ # DISCoVeR implementation
│ ├── python/ # Python utilities
│ └── r/ # R utilities
├── environment.yml # Conda environment for Python
├── renv.lock # R package dependencies
└── README.md
- Conda
- R (≥ 4.0)
- Git
- Create and activate the conda environment:
# Using conda
conda env create -f environment.yml
conda activate type-state- Verify installation:
python -c "import scanpy, torch, pyro; print('Python environment ready!')"The Python environment includes:
scanpyfor single-cell analysispytorchandpyrofor deep learningscladder(Patches implementation)scikit-learn,numpy,pandasfor data processingmatplotlib,seabornfor visualization
This project uses renv for R package management.
- Open R in the project directory and restore packages:
# Install renv if not already installed
install.packages("renv")
# Restore the R environment
renv::restore()- Verify installation:
library(SingleCellExperiment)
library(scater)
library(tidyverse)
print("R environment ready!")The R environment includes:
SingleCellExperiment,scater,scranfor single-cell analysistidyversefor data manipulation and visualization- Bioconductor packages for genomic analysis
ggplot2,pheatmap,viridisfor visualization
If renv::restore() encounters issues, you can manually install key packages:
install.packages(c("tidyverse", "BiocManager", "renv"))
BiocManager::install(c("SingleCellExperiment", "scater", "scran",
"splatter", "edgeR"))The analysis follows a structured pipeline for both Kang (experimental) and simulation datasets:
- Datasets are not included in this repository due to their size
- Datasets, FSF scores, and FSF selections can be obtained by running the snakemake workflow from https://github.com/HelenaLC/type-state
- Load and preprocess raw data
- Quality control and normalization
- Export to appropriate formats
- Patches (
01-patches_*.ipynb): Train VAE models with interpretable loadings - Scores (
01-sco_*.qmd): Load classical feature selection scores and compute Patches-based selection scores
- Combine scores from multiple methods
- Create gene selection strategies
- Export selections for downstream analysis
- UMAP visualization
- Clustering evaluation (ARI)
- Mixing (LISI)
- Principal component regression (PCR)
- Mutual information analysis
- Train DISCoVeR models
- Extract embeddings
- Comparative evaluation with Patches (currently embedding level only)