A GPU-accelerated tool for large-scale scRNA-seq pipeline.
Highlights • Why ScaleSC • Installation • Tutorial • API Reference
- Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as
scanpy
andrapids-singlecell
. - Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
- Chunk the data to avoid the
int32
limitation incupyx.scipy.sparse
used byrapids-singlecell
that disables the computing for moderate-size datasets (~1.3M) without Multi-GPU support. - Reconcile output at each step to
scanpy
to reproduce the same results as on CPU end. - Improvement on
harmonypy
which allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU. - Speedup and optimize
NSForest
algorithm using GPU for better maker gene identification. - Merge clusters according to the gene expression of detected markers by
NSForest
.
Scanpy |
ScaleSC |
Rapids-singlecell |
|
---|---|---|---|
GPU Support | ❌ | ✅ | ✅ |
int32 Issue in Sparse |
✅ | ✅ | ❌ |
Upper Limit of #cell | 5M | ~20M | ~1M |
Upper Limit of #sample | <100 | >1000 | <100 |
Requirements:
- RAPIDS from Nvidia
- rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
- Conda, version >=22.11 is strongly encouraged, because conda-libmamba-solver is set as default, which significantly speeds up solving dependencies.
- pip, a python package installer.
Environment Setup:
-
Install RAPIDS through Conda,
conda create -n ScaleSC -c rapidsai -c conda-forge -c nvidia rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8'
Users have the flexibility to install it according to their systems by using this online selector. We highly recommend installing**RAPIDS**>=24.12
, it solves a bug related to the Leiden algorithm, which results in too many clusters. -
Activate conda env,
conda activate ScaleSC
-
Install rapids-singlecell using pip,
pip install rapids-singlecell
-
Install ScaleSC,
- Pull ScaleSC from GitHub
git clone https://github.com/interactivereport/ScaleSC.git
- Enter the folder and install ScaleSC
cd scalesc
pip install .
- Pull ScaleSC from GitHub
-
Check env:
python -c "import scalesc; print(scalesc.__version__)"
== 0.1.0python -c "import cupy; print(cupy.__version__)"
>= 13.3.0python -c "import cuml; print(cuml.__version__)"
>= 24.10python -c "import cupy; print(cupy.cuda.is_available())"
= Truepython -c "import xgboost; print(xgboost.__version__)
>= 2.1.1, optionally for marker annotation
- See this tutorial for details.
Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.
- 2/26/2025:
- adding a parameter
threshold
in functionadata_cluster_merge
to support cluster merging at various scales according to the user's specification.threshold
is between 0 and 1. Set to 0 by default. - Updating a few more examples of cluster merging in the tutorial.
- future work: adding support for loading from large
.h5ad
files.
- adding a parameter
ScaleSC integrated pipeline in a scanpy-like style.
It will automatically load the dataset in chunks, see scalesc.util.AnnDataBatchReader
for details, and all methods in this class manipulate this chunked data.
Args:
data_dir
(str
): Data folder of the dataset.max_cell_batch
(int
): Maximum number of cells in a single batch.Default
: 100000.preload_on_cpu
(bool
): If load the entire chunked data on CPU. Default:True
preload_on_gpu
(bool
): If the entire chunked data is on GPU,preload_on_cpu
will be overwritten toTrue
when this is set toTrue
. The default isTrue
.save_raw_counts
(bool
): If saveadata_X
to disk after QC filtering.Default
: False.save_norm_counts
(bool
): If saveadata_X
data to disk after normalization.Default
: False.save_after_each_step
(bool
): If saveadata
(without .X) to disk after each step.Default
: False.output_dir
(str
): Output folder. Default: './results'.gpus
(list
): List of GPU ids,[0]
is set if this is None. Default: None.
__init__(
data_dir,
max_cell_batch=100000.0,
preload_on_cpu=True,
preload_on_gpu=True,
save_raw_counts=False,
save_norm_counts=False,
save_after_each_step=False,
output_dir='results',
gpus=None
)
AnnData
: An AnnData object that is used to store all intermediate results without the count matrix.
Note: This is always on the CPU.
AnnData
: An AnnData
object that is used to store all intermediate results, including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption; make sure to invoke to_CPU()
before calling this object.
calculate_qc_metrics()
Calculate quality control metrics.
clear()
Clean the memory
filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)
Filter genes based on the number of a QC metric.
Args:
min_count
(int
): Minimum number of counts required for a cell to pass filtering.max_count
(int
): Maximum number of counts required for a cell to pass filtering.qc_var
(str
='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)
Filter genes based on the number of a QC metric.
Args:
min_count
(int
): Minimum number of counts required for a gene to pass filtering.max_count
(int
): Maximum number of counts required for a gene to pass filtering.qc_var
(str
='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
filter_genes_and_cells(
min_counts_per_gene=0,
min_counts_per_cell=0,
max_counts_per_gene=None,
max_counts_per_cell=None,
qc_var_gene='n_cells_by_counts',
qc_var_cell='n_genes_by_counts',
qc=False
)
Filter genes based on the number of a QC metric.
Note:
This is an efficient way to perform regular filtering on genes and cells without repeatedly iterating over chunks.
Args:
min_counts_per_gene
(int
): Minimum number of counts required for a gene to pass filtering.max_counts_per_gene
(int
): Maximum number of counts required for a gene to pass filtering.qc_var_gene
(str
='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.min_counts_per_cell
(int
): Minimum number of counts required for a cell to pass filtering.max_counts_per_cell
(int
): Maximum number of counts required for a cell to pass filtering.qc_var_cell
(str
='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
harmony(sample_col_name, n_init=10, max_iter_harmony=20)
Use Harmony to integrate different experiments.
Note:
This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into
adata.obsm['X_pca_harmony']
.
Args:
sample_col_name
(str
): Column of sample ID.n_init
(int
=10
): Number of times the k-means algorithm is run with different centroid seeds.max_iter_harmony
(int
=20
): Maximum iteration number of harmony.
highly_variable_genes(n_top_genes=4000, method='seurat_v3')
Annotate highly variable genes.
Note:
Only
seurat_v3
is implemented. The raw count matrix is expected as input forseurat_v3
. HVGs are set toTrue
inadata.var['highly_variable']
.
Args:
n_top_genes
(int
=4000
): Number of highly-variable genes to keep.method
(str
='seurat_v3'
): Choose the flavor for identifying highly variable genes.
leiden(resolution=0.5, random_state=42)
Performs Leiden clustering using rapids-singlecell
.
Args:
resolution
(float
=0.5
): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.random_state
(int
=42
): Random seed.
neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')
Compute a neighborhood graph of observations using rapids-singlecell
.
Args:
n_neighbors
(int
=20
): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.n_pcs
(int
=50
): Use this many PCs.use_rep
(str
='X_pca_harmony'
): Use the indicated representation.algorithm
(str
='cagra'
): The query algorithm to use.
normalize_log1p(target_sum=10000.0)
Normalize counts per cell, then log1p.
Note:
If
save_raw_counts
orsave_norm_counts
is set, writeadata_X
to disk here automatically.
Args:
target_sum
(int
=1e4
): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
normalize_log1p_pca(
target_sum=10000.0,
n_components=50,
hvg_var='highly_variable'
)
An alternative for calling normalize_log1p
and pca
together.
Note:
Used when
preload_on_cpu
isFalse
.
pca(n_components=50, hvg_var='highly_variable')
Principal component analysis.
Computes PCA coordinates, loadings, and variance decomposition. Uses the implementation of scikit-learn.
Note:
Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in
adata.obsm['X_pca']
.
Args:
n_components
(int
=50
): Number of principal components to compute.hvg_var
(str
='highly_variable'
): Use highly variable genes only.
save(data_name=None)
Save adata
to disk.
Note:
Save to '
output_dir
/data_name
.h5ad'.
Args:
data_name
(str
): IfNone
, set asdata_dir
.
savex(name, data_name=None)
Save adata
to disk in chunks.
Note:
Each chunk will be saved individually in a subfolder under
output_dir
. Save to 'output_dir
/name
/data_name
_i
.h5ad'.
Args:
name
(str
): Subfolder name.data_name
(str
): IfNone
, set asdata_dir
.
to_CPU()
Move all chunks to the CPU.
to_GPU()
Move all chunks to the GPU.
umap(random_state=42)
Embed the neighborhood graph using rapids-singlecell
.
Args:
random_state
(int
=42
): Random seed.
Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.
__init__(
data_dir,
preload_on_cpu=True,
preload_on_gpu=False,
gpus=None,
max_cell_batch=100000,
max_gpu_memory_usage=48.0,
return_anndata=True
)
batch_to_CPU()
batch_to_GPU()
batchify(axis='cell')
Return a data generator if preload_on_cpu
is set as True
.
clear()
get_merged_adata_with_X()
gpu_wrapper(generator)
read(fname)
set_cells_filter(filter, update=True)
Update the cells filter and apply it to data chunks if update
is set to True
, otherwise, update the filter only.
set_genes_filter(filter, update=True)
Update genes filter and apply on data chunks if update
set to True, otherwise, update filter only.
Note:
Genes filter can be set sequentially; a new filter should always be compatible with the previously filtered data.
update_by_cells_filter(filter)
update_by_genes_filter(filter)
This file was automatically generated via lazydocs.