The model code for scFoundation
Our model is designed to process gene expression data, either from single cells or in bulk, and return the corresponding cell or gene context embeddings.
Required Package: To run the provided Python script, ensure you have the following packages installed:
argparse
numpy
pandas
os
scipy
pytorch
einops
scanpy
local_attention
- Convert the gene symbol in your data to match our list
OS_scRNA_gene_index.19264.tsv
. - For Python users, you can use the
main_gene_selection
function inget_embedding.py
:# X_df represents your single cell data with cells in rows and genes in columns gene_list_df = pd.read_csv('../OS_scRNA_gene_index.19264.tsv', header=0, delimiter='\t') gene_list = list(gene_list_df['gene_name']) X_df, to_fill_columns, var = main_gene_selection(X_df, gene_list)
- Save your data
X_df
in eithernpy
orcsv
format.
- Please download the model weight file via https://hopebio2020.sharepoint.com/:f:/s/PublicSharedfiles/EmUQnvZMETlDvoCaBduCNeIBQArcOrd8T8iEpiGofFZ9CQ?e=3SpPZU and put it into the
models
folder - Please download the raw gene expression example data used for inference from Figshare: https://doi.org/10.6084/m9.figshare.24049200 , and then unzip it as a folder named
examples
- In the
demo.sh
file, we provide several scripts to infer various types of embeddings, including single cell, bulk, and gene embeddings. To run these scripts, simply copy the corresponding Python command and paste it into your command line or terminal. - Here's an example command for inferring cell embeddings:
### Cell embedding python get_embedding.py --task_name Baron --input_type singlecell --output_type cell --pool_type all --tgthighres a5 --data_path ./examples/enhancement/Baron_enhancement.csv --save_path ./examples/enhancement/ --pre_normalized F --version rde
After running these scripts, you will generate embeddings for use in our downstream task analyses. To verify the consistency of these generated embeddings with those provided in the downstream task folders, refer to the check_consistency.ipynb
file. Please note that due to differences in CUDA driver versions and hardware configurations, minor numerical differences might occur beyond the thousandth decimal place.
For a quick start, simply modify the --data_path
to point to your data.
Below are detailed descriptions for each argument:
-
input_type: Specifies the type of input.
- Choices:
singlecell
,bulk
- Default:
singlecell
- Choices:
-
output_type: Determines the type of output.
- Choices:
cell
,gene
,gene_batch
- Default:
cell
- Note: In
cell
mode, The output shape is (N,h), where N is the number of cells, h is the hidden dimension. Ingene*
mode, The output shape is (N,19264,h),where N is the number of cells,19264 is the gene number, and h is the hidden dimension. Ingene
mode, gene embedding of each cell is processed individually. Ingene_batch
mode, all cells in your data are treated as a single batch and processed together. Ensure the number of input cells doesn't exceed 5 in this mode.
- Choices:
-
pool_type: Defines the pooling types for cell embedding.
- Choices:
all
,max
- Default:
all
- The method of getting cell embeddings. Applicable only when
output_type
is set tocell
.
- Choices:
-
tgthighres: Sets the value of token T.
- Default:
t4
- Note: Can be set in three ways - targeted high resolution which means T=number (starting with 't'), fold change of high resolution which means T/S=number (starting with 'f'), or addition of high resolution which means T=S+number (starting with 'a'). Only valid when
input_type
issinglecell
.
- Default:
-
pre_normalized: Controls the computation method for the S token.
- Choices:
F
,T
,A
- Default:
F
- Note: When
input_type
issinglecell
,T
orF
indicates if the input gene expression data is already normalized+log1p.A
means data is normalized+log1p with the total count appended at the end, resulting in a data shape of N*19265. This mode is used for the GEARS task. Forbulk
input type,F
means the T and S token values are log10(sum of gene expression), whileT
means they are the sum without log transformation. This is useful for bulk data with few sequenced genes.
- Choices:
-
version: Model versions for generating cell embeddings.
- Default:
ce
- Note: Use
rce
for read depth enhancement andce
otherwise. Only valid whenoutput_type
iscell
.
- Default:
-
model_path: Path to the model.
- Default:
None
- Default:
-
ckpt_name: Checkpoint Name.
- Default:
01B-resolution
- Default:
To finetune or integrate the scFoundation model with additional layers or models, you can refer to the example model code provided in the finetune_model.py
file. The essential steps involve loading the scFoundation model and appending it with other layers as needed. Here's a snippet to get you started:
import sys
sys.path.append("../model/") # path to this folder
from load import *
pretrainmodel,pretrainconfig = load_model_frommmf(ckpt_path,key)
self.token_emb = model.token_emb
self.pos_emb = model.pos_emb
self.encoder = model.encoder
If you're facing GPU memory limitations, the following code allows you to finetune only a part of the scFoundation model.
for na, param in self.encoder.named_parameters():
param.requires_grad = False
for na, param in self.encoder.transformer_encoder[-2].named_parameters():
print('self.encoder.transformer_encoder ',na,' have grad')
param.requires_grad = True
Once you've defined the finetuned-model class based on scFoundation, it can be incorporated into your existing training loop code. We have updated the GEARS directory, demonstrating how the scFoundation model can be seamlessly integrated and finetuned with the GEARS model.
scFoundation inference code uses and/or references the following separate libraries and packages (ordered alphabetically):
Thanks for all their contributors and maintainers!