Skip to content

theislab/ssl_in_scg

Repository files navigation

Delineating the Effective Use of Self-Supervised Learning in Single-Cell Genomics

Repository for the paper.

System Requirements

  • Python 3.10
  • Dependencies listed in requirements.txt

Installation Guide

  1. Create a conda environment:

    conda env create -f environment.yml
  2. Activate the environment:

    conda activate ssl
  3. Install the package in development mode:

    cd directory_where_you_have_your_git_repos/ssl_in_scg
    pip install -e .
  4. Create symlink to the storage folder for experiments:

    cd directory_where_you_have_your_git_repos/ssl_in_scg
    ln -s folder_for_experiment_storage project_folder

Demo

Large Dataset:

For large datasets, use the store-creation notebooks in the scTab repository to create a Merlin datamodule for efficient data loading.

Small Dataset or Single Adata Object:

For small datasets or a single Adata object, a simple PyTorch dataloader suffices. Refer to our multiomics application. A minimal example for masked pre-training of a smaller adata object is available in sc_mae.

Expected output:

Running the models will generate a checkpoint file with trained model parameters, saved using PyTorch Lightning's checkpointing functionality. This file can be used for inference, further training, or reproducibility.

Expected run time:

We pre-trained on a single GPU for approximately 1-2 days and fine-tuned on a single GPU about 12-24 hours. This depends, among others, on the underlying architecture, dataset, and hyperparameters. So, convergence should be watched.

Model checkpoints

Pre-trained model checkpoints are available on Hugging Face.

Retraining

Obtain the dataset from the scTab repository or write a Merlin store on your custom data. Then change DATA_DIR in paths.py to your custom dataset or keep it with the scTab dataset. After that, follow the scripts for pre-training and fine-tuning.

Citation

If you find our work useful, please cite the following paper:

Delineating the Effective Use of Self-Supervised Learning in Single-Cell Genomics

Link to the paper

If you use the scTab data in your research, please cite the following paper:

Scaling cross-tissue single-cell annotation models

Link to the paper

Licence

self_supervision is licensed under the MIT License.

Authors

ssl_in_scg was written by Till Richter, Mojtaba Bahrami, Yufan Xia and Felix Fischer .