ScaleSC

A GPU-accelerated tool for large-scale scRNA-seq pipeline.

Highlights • Why ScaleSC • Installation • Tutorial • API Reference

Highlights

Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as scanpy and rapids-singlecell.
Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
Chunk the data to avoid the int32 limitation in cupyx.scipy.sparse used by rapids-singlecell that disables the computing for moderate-size datasets (~1.3M) without Multi-GPU support.
Reconcile output at each step to scanpy to reproduce the same results as on CPU end.
Improvement on harmonypy which allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU.
Speedup and optimize NSForest algorithm using GPU for better maker gene identification.
Merge clusters according to the gene expression of detected markers by NSForest.

Why ScaleSC

What can ScaleSC do?

ScaleSC Overview

Overview of different packages*

	`Scanpy`	`ScaleSC`	`Rapids-singlecell`
GPU Support	❌	✅	✅
`int32` Issue in Sparse	✅	✅	❌
Upper Limit of #cell	5M	~20M	~1M
Upper Limit of #sample	<100	>1000	<100

Time comparsion between scanpy(CPU) and ScaleSC(GPU) on A100(80G)

How To Install

Note: ScaleSC requires a high-end GPU (> 24G VRAM) and a matching CUDA version to support GPU-accelerated computing.

Requirements:

RAPIDS from Nvidia
rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
Conda, version >=22.11 is strongly encouraged, because conda-libmamba-solver is set as default, which significantly speeds up solving dependencies.
pip, a python package installer.

Environment Setup:

Install RAPIDS through Conda,
conda create -n ScaleSC -c rapidsai -c conda-forge -c nvidia rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8' Users have the flexibility to install it according to their systems by using this online selector. We highly recommend installing **RAPIDS**>=24.12, it solves a bug related to the Leiden algorithm, which results in too many clusters.
Activate conda env,
conda activate ScaleSC
Install rapids-singlecell using pip,
pip install rapids-singlecell
Install ScaleSC,
- Pull ScaleSC from GitHub
  git clone https://github.com/interactivereport/ScaleSC.git
- Enter the folder and install ScaleSC
  cd scalesc
  pip install .
Check env:
- python -c "import scalesc; print(scalesc.__version__)" == 0.1.0
- python -c "import cupy; print(cupy.__version__)" >= 13.3.0
- python -c "import cuml; print(cuml.__version__)" >= 24.10
- python -c "import cupy; print(cupy.cuda.is_available())" = True
- python -c "import xgboost; print(xgboost.__version__) >= 2.1.1, optionally for marker annotation

Tutorial:

See this tutorial for details.

Citation

Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.

Updates:

2/26/2025:
- adding a parameter threshold in function adata_cluster_merge to support cluster merging at various scales according to the user's specification. threshold is between 0 and 1. Set to 0 by default.
- Updating a few more examples of cluster merging in the tutorial.
- future work: adding support for loading from large .h5ad files.

Contact

@haotianzh

API Reference

`class` `ScaleSC`

ScaleSC integrated pipeline in a scanpy-like style.

It will automatically load the dataset in chunks, see scalesc.util.AnnDataBatchReader for details, and all methods in this class manipulate this chunked data.

Args:

data_dir (str): Data folder of the dataset.
max_cell_batch (int): Maximum number of cells in a single batch.
Default: 100000.
preload_on_cpu (bool): If load the entire chunked data on CPU. Default: True
preload_on_gpu (bool): If the entire chunked data is on GPU, preload_on_cpu will be overwritten to True when this is set to True. The default is True.
save_raw_counts (bool): If save adata_X to disk after QC filtering.
Default: False.
save_norm_counts (bool): If save adata_X data to disk after normalization.
Default: False.
save_after_each_step (bool): If save adata (without .X) to disk after each step.
Default: False.
output_dir (str): Output folder. Default: './results'.
gpus (list): List of GPU ids, [0] is set if this is None. Default: None.

`method` `init`

__init__(
    data_dir,
    max_cell_batch=100000.0,
    preload_on_cpu=True,
    preload_on_gpu=True,
    save_raw_counts=False,
    save_norm_counts=False,
    save_after_each_step=False,
    output_dir='results',
    gpus=None
)

`property` adata

AnnData: An AnnData object that is used to store all intermediate results without the count matrix.

Note: This is always on the CPU.

`property` adata_X

AnnData: An AnnData object that is used to store all intermediate results, including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption; make sure to invoke to_CPU() before calling this object.

`method` `calculate_qc_metrics`

calculate_qc_metrics()

Calculate quality control metrics.

`method` `clear`

clear()

Clean the memory

`method` `filter_cells`

filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)

Filter genes based on the number of a QC metric.

Args:

min_count (int): Minimum number of counts required for a cell to pass filtering.
max_count (int): Maximum number of counts required for a cell to pass filtering.
qc_var (str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.
qc (bool=False): Call calculate_qc_metrics before filtering.

`method` `filter_genes`

filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)

Filter genes based on the number of a QC metric.

Args:

min_count (int): Minimum number of counts required for a gene to pass filtering.
max_count (int): Maximum number of counts required for a gene to pass filtering.
qc_var (str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.
qc (bool=False): Call calculate_qc_metrics before filtering.

`method` `filter_genes_and_cells`

filter_genes_and_cells(
    min_counts_per_gene=0,
    min_counts_per_cell=0,
    max_counts_per_gene=None,
    max_counts_per_cell=None,
    qc_var_gene='n_cells_by_counts',
    qc_var_cell='n_genes_by_counts',
    qc=False
)

Filter genes based on the number of a QC metric.

Note:

This is an efficient way to perform regular filtering on genes and cells without repeatedly iterating over chunks.

Args:

min_counts_per_gene (int): Minimum number of counts required for a gene to pass filtering.
max_counts_per_gene (int): Maximum number of counts required for a gene to pass filtering.
qc_var_gene (str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.
min_counts_per_cell (int): Minimum number of counts required for a cell to pass filtering.
max_counts_per_cell (int): Maximum number of counts required for a cell to pass filtering.
qc_var_cell (str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.
qc (bool=False): Call calculate_qc_metrics before filtering.

`method` `harmony`

harmony(sample_col_name, n_init=10, max_iter_harmony=20)

Use Harmony to integrate different experiments.

Note:

This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into adata.obsm['X_pca_harmony'].

Args:

sample_col_name (str): Column of sample ID.
n_init (int=10): Number of times the k-means algorithm is run with different centroid seeds.
max_iter_harmony (int=20): Maximum iteration number of harmony.

`method` `highly_variable_genes`

highly_variable_genes(n_top_genes=4000, method='seurat_v3')

Annotate highly variable genes.

Note:

Only seurat_v3 is implemented. The raw count matrix is expected as input for seurat_v3. HVGs are set to True in adata.var['highly_variable'].

Args:

n_top_genes (int=4000): Number of highly-variable genes to keep.
method (str='seurat_v3'): Choose the flavor for identifying highly variable genes.

`method` `leiden`

leiden(resolution=0.5, random_state=42)

Performs Leiden clustering using rapids-singlecell.

Args:

resolution (float=0.5): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.
random_state (int=42): Random seed.

`method` `neighbors`

neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')

Compute a neighborhood graph of observations using rapids-singlecell.

Args:

n_neighbors (int=20): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.
n_pcs (int=50): Use this many PCs.
use_rep (str='X_pca_harmony'): Use the indicated representation.
algorithm (str='cagra'): The query algorithm to use.

`method` `normalize_log1p`

normalize_log1p(target_sum=10000.0)

Normalize counts per cell, then log1p.

Note:

If save_raw_counts or save_norm_counts is set, write adata_X to disk here automatically.

Args:

target_sum (int=1e4): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

`method` `normalize_log1p_pca`

normalize_log1p_pca(
    target_sum=10000.0,
    n_components=50,
    hvg_var='highly_variable'
)

An alternative for calling normalize_log1p and pca together.

Note:

Used when preload_on_cpu is False.

`method` `pca`

pca(n_components=50, hvg_var='highly_variable')

Principal component analysis.

Computes PCA coordinates, loadings, and variance decomposition. Uses the implementation of scikit-learn.

Note:

Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in adata.obsm['X_pca'].

Args:

n_components (int=50): Number of principal components to compute.
hvg_var (str='highly_variable'): Use highly variable genes only.

`method` `save`

save(data_name=None)

Save adata to disk.

Note:

Save to 'output_dir/data_name.h5ad'.

Args:

data_name (str): If None, set as data_dir.

`method` `savex`

savex(name, data_name=None)

Save adata to disk in chunks.

Note:

Each chunk will be saved individually in a subfolder under output_dir. Save to 'output_dir/name/data_name_i.h5ad'.

Args:

name (str): Subfolder name.
data_name (str): If None, set as data_dir.

`method` `to_CPU`

to_CPU()

Move all chunks to the CPU.

`method` `to_GPU`

to_GPU()

Move all chunks to the GPU.

`method` `umap`

umap(random_state=42)

Embed the neighborhood graph using rapids-singlecell.

Args:

random_state (int=42): Random seed.

`class` `AnnDataBatchReader`

Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.

`method` `init`

__init__(
    data_dir,
    preload_on_cpu=True,
    preload_on_gpu=False,
    gpus=None,
    max_cell_batch=100000,
    max_gpu_memory_usage=48.0,
    return_anndata=True
)

`property` shape

`method` `batch_to_CPU`

batch_to_CPU()

`method` `batch_to_GPU`

batch_to_GPU()

`method` `batchify`

batchify(axis='cell')

Return a data generator if preload_on_cpu is set as True.

`method` `clear`

clear()

`method` `get_merged_adata_with_X`

get_merged_adata_with_X()

`method` `gpu_wrapper`

gpu_wrapper(generator)

`method` `read`

read(fname)

`method` `set_cells_filter`

set_cells_filter(filter, update=True)

Update the cells filter and apply it to data chunks if update is set to True, otherwise, update the filter only.

`method` `set_genes_filter`

set_genes_filter(filter, update=True)

Update genes filter and apply on data chunks if update set to True, otherwise, update filter only.

Note:

Genes filter can be set sequentially; a new filter should always be compatible with the previously filtered data.

`method` `update_by_cells_filter`

update_by_cells_filter(filter)

`method` `update_by_genes_filter`

update_by_genes_filter(filter)

This file was automatically generated via lazydocs.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
docs		docs
img		img
scalesc		scalesc
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Tutorial_scaleSC.ipynb		Tutorial_scaleSC.ipynb
pyproject.toml		pyproject.toml

License

interactivereport/ScaleSC

Folders and files

Latest commit

History

Repository files navigation

ScaleSC

Highlights

Why ScaleSC

How To Install

Note: ScaleSC requires a high-end GPU (> 24G VRAM) and a matching CUDA version to support GPU-accelerated computing.

Tutorial:

Citation

Updates:

Contact

API Reference

class ScaleSC

method __init__

property adata

property adata_X

method calculate_qc_metrics

method clear

method filter_cells

method filter_genes

method filter_genes_and_cells

method harmony

method highly_variable_genes

method leiden

method neighbors

method normalize_log1p

method normalize_log1p_pca

method pca

method save

method savex

method to_CPU

method to_GPU

method umap

class AnnDataBatchReader

method __init__

property shape

method batch_to_CPU

method batch_to_GPU

method batchify

method clear

method get_merged_adata_with_X

method gpu_wrapper

method read

method set_cells_filter

method set_genes_filter

method update_by_cells_filter

method update_by_genes_filter

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`class` `ScaleSC`

`method` `init`

`property` adata

`property` adata_X

`method` `calculate_qc_metrics`

`method` `clear`

`method` `filter_cells`

`method` `filter_genes`

`method` `filter_genes_and_cells`

`method` `harmony`

`method` `highly_variable_genes`

`method` `leiden`

`method` `neighbors`

`method` `normalize_log1p`

`method` `normalize_log1p_pca`

`method` `pca`

`method` `save`

`method` `savex`

`method` `to_CPU`

`method` `to_GPU`

`method` `umap`

`class` `AnnDataBatchReader`

`method` `init`

`property` shape

`method` `batch_to_CPU`

`method` `batch_to_GPU`

`method` `batchify`

`method` `clear`

`method` `get_merged_adata_with_X`

`method` `gpu_wrapper`

`method` `read`

`method` `set_cells_filter`

`method` `set_genes_filter`

`method` `update_by_cells_filter`

`method` `update_by_genes_filter`

Packages