This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.
!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.
scGPT works with Python >= 3.7 and R >=3.6.1. Please make sure you have the correct version of Python and R installed pre-installation.
scGPT is available on PyPI. To install scGPT, run the following command:
$ pip install scgpt
[Optional] We recommend using wandb for logging and visualization.
$ pip install wandb
For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.
$ git clone this-repo-url
$ cd scGPT
$ poetry install
Note: The flash-attn
dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.
Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human
model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well.
Model name | Description | Download |
---|---|---|
whole-human (recommended) | Pretrained on 33 million normal human cells. | link |
brain | Pretrained on 13.2 million brain cells. | link |
blood | Pretrained on 10.3 million blood and bone marrow cells. | link |
heart | Pretrained on 1.8 million heart cells | link |
lung | Pretrained on 2.1 million lung cells | link |
kidney | Pretrained on 814 thousand kidney cells | link |
pan-cancer | Pretrained on 5.7 million cells of various cancer types | link |
Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save
directory.
- Upload the pretrained model checkpoint
- Publish to pypi
- Provide the pretraining code with generative attention masking
- Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
- Example code for Gene Regulatory Network analysis
- Documentation website with readthedocs
- Bump up to pytorch 2.0
- New pretraining on larger datasets
- Reference mapping example
- Publish to huggingface model hub
We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.
We sincerely thank the authors of following open-source projects:
@article{cui2023scGPT,
title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI},
author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo},
journal={bioRxiv},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}