Skip to content

interactivereport/ScaleSC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleSC

A GPU-accelerated tool for large-scale scRNA-seq pipeline.

Highlights • Why ScaleSC • Installation • Tutorial • API Reference

Highlights

  • Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as scanpy and rapids-singlecell.
  • Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
  • Chunk the data to avoid the int32 limitation in cupyx.scipy.sparse used by rapids-singlecell that disables the computing for moderate-size datasets (~1.3M) without Multi-GPU support.
  • Reconcile output at each step to scanpy to reproduce the same results as on CPU end.
  • Improvement on harmonypy which allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU.
  • Speedup and optimize NSForest algorithm using GPU for better maker gene identification.
  • Merge clusters according to the gene expression of detected markers by NSForest.

Why ScaleSC

What can ScaleSC do?

pipeline

ScaleSC Overview

overview



Overview of different packages*
Scanpy ScaleSC Rapids-singlecell
GPU Support ❌ ✅ ✅
int32 Issue in Sparse ✅ ✅ ❌
Upper Limit of #cell 5M ~20M ~1M
Upper Limit of #sample <100 >1000 <100



Time comparsion between scanpy(CPU) and ScaleSC(GPU) on A100(80G)

time-comp

How To Install

Note: ScaleSC requires a high-end GPU (> 24G VRAM) and a matching CUDA version to support GPU-accelerated computing.

Requirements:

  • RAPIDS from Nvidia
  • rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
  • Conda, version >=22.11 is strongly encouraged, because conda-libmamba-solver is set as default, which significantly speeds up solving dependencies.
  • pip, a python package installer.

Environment Setup:

  1. Install RAPIDS through Conda,
    conda create -n ScaleSC -c rapidsai -c conda-forge -c nvidia rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8' Users have the flexibility to install it according to their systems by using this online selector. We highly recommend installing **RAPIDS**>=24.12, it solves a bug related to the Leiden algorithm, which results in too many clusters.

  2. Activate conda env,
    conda activate ScaleSC

  3. Install rapids-singlecell using pip,
    pip install rapids-singlecell

  4. Install ScaleSC,

    • Pull ScaleSC from GitHub
      git clone https://github.com/interactivereport/ScaleSC.git
    • Enter the folder and install ScaleSC
      cd scalesc
      pip install .
  5. Check env:

    • python -c "import scalesc; print(scalesc.__version__)" == 0.1.0
    • python -c "import cupy; print(cupy.__version__)" >= 13.3.0
    • python -c "import cuml; print(cuml.__version__)" >= 24.10
    • python -c "import cupy; print(cupy.cuda.is_available())" = True
    • python -c "import xgboost; print(xgboost.__version__) >= 2.1.1, optionally for marker annotation

Tutorial:

Citation

Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.

Updates:

  • 2/26/2025:
    • adding a parameter threshold in function adata_cluster_merge to support cluster merging at various scales according to the user's specification. threshold is between 0 and 1. Set to 0 by default.
    • Updating a few more examples of cluster merging in the tutorial.
    • future work: adding support for loading from large .h5ad files.

Contact

API Reference

class ScaleSC

ScaleSC integrated pipeline in a scanpy-like style.

It will automatically load the dataset in chunks, see scalesc.util.AnnDataBatchReader for details, and all methods in this class manipulate this chunked data.

Args:

  • data_dir (str): Data folder of the dataset.
  • max_cell_batch (int): Maximum number of cells in a single batch.
  • Default: 100000.
  • preload_on_cpu (bool): If load the entire chunked data on CPU. Default: True
  • preload_on_gpu (bool): If the entire chunked data is on GPU, preload_on_cpu will be overwritten to True when this is set to True. The default is True.
  • save_raw_counts (bool): If save adata_X to disk after QC filtering.
  • Default: False.
  • save_norm_counts (bool): If save adata_X data to disk after normalization.
  • Default: False.
  • save_after_each_step (bool): If save adata (without .X) to disk after each step.
  • Default: False.
  • output_dir (str): Output folder. Default: './results'.
  • gpus (list): List of GPU ids, [0] is set if this is None. Default: None.

method __init__

__init__(
    data_dir,
    max_cell_batch=100000.0,
    preload_on_cpu=True,
    preload_on_gpu=True,
    save_raw_counts=False,
    save_norm_counts=False,
    save_after_each_step=False,
    output_dir='results',
    gpus=None
)

property adata

AnnData: An AnnData object that is used to store all intermediate results without the count matrix.

Note: This is always on the CPU.


property adata_X

AnnData: An AnnData object that is used to store all intermediate results, including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption; make sure to invoke to_CPU() before calling this object.


method calculate_qc_metrics

calculate_qc_metrics()

Calculate quality control metrics.


method clear

clear()

Clean the memory


method filter_cells

filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)

Filter genes based on the number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a cell to pass filtering.
  • max_count (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var (str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes

filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)

Filter genes based on the number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a gene to pass filtering.
  • max_count (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var (str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes_and_cells

filter_genes_and_cells(
    min_counts_per_gene=0,
    min_counts_per_cell=0,
    max_counts_per_gene=None,
    max_counts_per_cell=None,
    qc_var_gene='n_cells_by_counts',
    qc_var_cell='n_genes_by_counts',
    qc=False
)

Filter genes based on the number of a QC metric.

Note:

This is an efficient way to perform regular filtering on genes and cells without repeatedly iterating over chunks.

Args:

  • min_counts_per_gene (int): Minimum number of counts required for a gene to pass filtering.
  • max_counts_per_gene (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var_gene (str='n_cells_by_counts'): Feature in QC metrics that is used to filter genes.
  • min_counts_per_cell (int): Minimum number of counts required for a cell to pass filtering.
  • max_counts_per_cell (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var_cell (str='n_genes_by_counts'): Feature in QC metrics that is used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method harmony

harmony(sample_col_name, n_init=10, max_iter_harmony=20)

Use Harmony to integrate different experiments.

Note:

This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into adata.obsm['X_pca_harmony'].

Args:

  • sample_col_name (str): Column of sample ID.
  • n_init (int=10): Number of times the k-means algorithm is run with different centroid seeds.
  • max_iter_harmony (int=20): Maximum iteration number of harmony.

method highly_variable_genes

highly_variable_genes(n_top_genes=4000, method='seurat_v3')

Annotate highly variable genes.

Note:

Only seurat_v3 is implemented. The raw count matrix is expected as input for seurat_v3. HVGs are set to True in adata.var['highly_variable'].

Args:

  • n_top_genes (int=4000): Number of highly-variable genes to keep.
  • method (str='seurat_v3'): Choose the flavor for identifying highly variable genes.

method leiden

leiden(resolution=0.5, random_state=42)

Performs Leiden clustering using rapids-singlecell.

Args:

  • resolution (float=0.5): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.
  • random_state (int=42): Random seed.

method neighbors

neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')

Compute a neighborhood graph of observations using rapids-singlecell.

Args:

  • n_neighbors (int=20): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.
  • n_pcs (int=50): Use this many PCs.
  • use_rep (str='X_pca_harmony'): Use the indicated representation.
  • algorithm (str='cagra'): The query algorithm to use.

method normalize_log1p

normalize_log1p(target_sum=10000.0)

Normalize counts per cell, then log1p.

Note:

If save_raw_counts or save_norm_counts is set, write adata_X to disk here automatically.

Args:

  • target_sum (int=1e4): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

method normalize_log1p_pca

normalize_log1p_pca(
    target_sum=10000.0,
    n_components=50,
    hvg_var='highly_variable'
)

An alternative for calling normalize_log1p and pca together.

Note:

Used when preload_on_cpu is False.


method pca

pca(n_components=50, hvg_var='highly_variable')

Principal component analysis.

Computes PCA coordinates, loadings, and variance decomposition. Uses the implementation of scikit-learn.

Note:

Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in adata.obsm['X_pca'].

Args:

  • n_components (int=50): Number of principal components to compute.
  • hvg_var (str='highly_variable'): Use highly variable genes only.

method save

save(data_name=None)

Save adata to disk.

Note:

Save to 'output_dir/data_name.h5ad'.

Args:

  • data_name (str): If None, set as data_dir.

method savex

savex(name, data_name=None)

Save adata to disk in chunks.

Note:

Each chunk will be saved individually in a subfolder under output_dir. Save to 'output_dir/name/data_name_i.h5ad'.

Args:

  • name (str): Subfolder name.
  • data_name (str): If None, set as data_dir.

method to_CPU

to_CPU()

Move all chunks to the CPU.


method to_GPU

to_GPU()

Move all chunks to the GPU.


method umap

umap(random_state=42)

Embed the neighborhood graph using rapids-singlecell.

Args:

  • random_state (int=42): Random seed.

class AnnDataBatchReader

Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.

method __init__

__init__(
    data_dir,
    preload_on_cpu=True,
    preload_on_gpu=False,
    gpus=None,
    max_cell_batch=100000,
    max_gpu_memory_usage=48.0,
    return_anndata=True
)

property shape


method batch_to_CPU

batch_to_CPU()

method batch_to_GPU

batch_to_GPU()

method batchify

batchify(axis='cell')

Return a data generator if preload_on_cpu is set as True.


method clear

clear()

method get_merged_adata_with_X

get_merged_adata_with_X()

method gpu_wrapper

gpu_wrapper(generator)

method read

read(fname)

method set_cells_filter

set_cells_filter(filter, update=True)

Update the cells filter and apply it to data chunks if update is set to True, otherwise, update the filter only.


method set_genes_filter

set_genes_filter(filter, update=True)

Update genes filter and apply on data chunks if update set to True, otherwise, update filter only.

Note:

Genes filter can be set sequentially; a new filter should always be compatible with the previously filtered data.


method update_by_cells_filter

update_by_cells_filter(filter)

method update_by_genes_filter

update_by_genes_filter(filter)

This file was automatically generated via lazydocs.

About

ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU. https://doi.org/10.1101/2025.01.28.635256

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages