scGeneScope

scGeneScope: A Perturbationally-Paired Single Cell Imaging and Transcriptomics Dataset and Benchmark for Treatment Response Modeling

Installation

Install poetry 1.8.5

curl -sSL https://install.python-poetry.org | python3 - --version 1.8.5

Clone the repo and cd into the directory:

git clone git@github.com:altoslabs/scGeneScope.git
cd scGeneScope

Installation with poetry native virtual environment

Make sure python 3.11 is available in your PATH (system dependent)

If python 3.11 is not available, you can install it with pyenv or create a conda env with python 3.11 and point poetry to that env with poetry env use /path/to/conda/env/bin/python.

Run poetry shell to create/activate a virtual environment.

Install the dependencies

poetry install --with dev

To activate the project environment:

Go to the project folder and run poetry shell

Downloading the embedding data and original data from HuggingFace

Set the required env variables.

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN={your_token_here}

Download the two h5ad files per model that contains the embeddings to the data directory.

cd <path-to-repo-folder>
huggingface-cli download altoslabs/scGeneScope features/rnaseq/pca/n2000/round_1.h5ad --repo-type dataset --local-dir ./data/ --local-dir-use-symlinks False
huggingface-cli download altoslabs/scGeneScope features/rnaseq/pca/n2000/round_2.h5ad --repo-type dataset --local-dir ./data/ --local-dir-use-symlinks False

Available embeddings:

features/rnaseq/pca/n2000
features/rnaseq/scvi/n200
features/rnaseq/scvi/scvi_1
features/rnaseq/scvi/scvi_2
features/rnaseq/geneformer
features/rnaseq/scgpt
features/rnaseq/UCE/4layer
features/imaging/imagenet/vit-l
features/imaging/imagenet/vit-h
features/imaging/imagenet/resnet50
features/imaging/imagenet/resnet152
features/imaging/imagenet/resnet50_clip
features/imaging/imagenet/vit-h_clip
features/imaging/imagenet/openphenom

Download the original imaging and scRNAseq data. Note -- all paper results and scGeneScope operations and results can be generated from the precomputed embeddings above. Users can download the original imaging and scRNAseq data to generate new embeddings or run new experiments on the raw data.

To download the scRNAseq data (~80G), run:

./scripts/download_scRNAseq_data.sh

To download a single plate of the imaging data (~173M, 3405 files):

./scripts/download_single_imaging_plate.sh

To download all of the imaging data (~186G, ~4,200,000 files):

./scripts/download_all_imaging_data.sh

Caution, the full imaging data download takes a while due to the large number of files associated.

Usage

Hydra Training Script

To run end-to-end model training, we provide a training script that is integrated with the Hydra configuration system. This script can be found under src/scgenescope/scripts/train.py and can be executed as follows:

cd <path-to-repo-folder>
python src/scgenescope/scripts/train.py <config-options>

The configuration options are discussed in the Configuration System section. The `train.py`` is also added to the environment as an executable script. As a result the above command can be shortened and called as follows:

train <config-options>

You can run an experiment as follows. You can replace the experiment name with any experiment available under src/scgenescope/config/experiment

train experiment=rnaseq/singleprofile/train_on_scgpt

Configuration System

This repo uses Hydra for configuration system management. A configuration setup and default settings are stored in src/scgenescope/config. This folders and the contained files should not be modified unless there are updates to the model codebase (such as adding new models, datasets, loggers, ...).

Below we describe potential workflows to use the configuration system to scale your experimentation:

Override any configuration parameter from the commandline

train trainer.max_epoch=100 model.classifier.hidden_dim=1024 model.classifier.depth=5

This command overrides the cfg.trainer.max_epoch value and sets it to 100, overrides cfg.model.classifier.hidden_dim value and sets it to 1024 and , overrides cfg.model.classifier.depth value and sets it to 5

Add any additional parameters that were not defined in the configuration system

python train.py +trainer.max_steps=5000

This will add an attribute field gradient_clip_val to the trainer.

Acknowledgments

Some of the hydra configs and utility functions are inspired from the lightning-hydra-template repo (https://github.com/ashleve/lightning-hydra-template).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
scripts		scripts
src/scgenescope		src/scgenescope
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scGeneScope

Installation

Downloading the embedding data and original data from HuggingFace

Usage

Hydra Training Script

Configuration System

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

altoslabs/scGeneScope

Folders and files

Latest commit

History

Repository files navigation

scGeneScope

Installation

Downloading the embedding data and original data from HuggingFace

Usage

Hydra Training Script

Configuration System

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages