This project provides a pipeline for single-cell data analysis using the Julia packages scLENS and scICE.
The scICE package, which stands for Single Cell Inconsistency Clustering Estimator, is specifically designed to perform multiple clustering runs and extract only the reliable labels that consistently appear across the runs.
It includes data preprocessing, embedding, clustering, and visualization of results. The code is designed to leverage CUDA for GPU acceleration when available.
To run this project, you will need the following:
- Julia: Version 1.6 or higher recommended. Tested on Julia 1.11.4.
- Python: Required for certain dependencies. Tested with Python 3.12.3.
- Operating Systems: Tested on Windows 11 and Ubuntu 22.04.
- Key Julia Packages:
CUDA,CSV,scLENS,CairoMakie(These will be automatically installed viaPkg.instantiate()from the project environment).
- Hardware: NVIDIA GPU with CUDA capability.
- Software: Appropriate NVIDIA drivers must be installed.
- CUDA Toolkit: Tested with CUDA Toolkit 12.2.
-
Clone the repository:
git clone https://github.com/Mathbiomed/scICE cd scICE -
Activate the project environment: In the Julia REPL, navigate to the project directory (
scICE) and run:import Pkg Pkg.activate(".") Pkg.instantiate()
Note: Installation, including package downloads and precompilation via
Pkg.instantiate(), typically takes 5 to 15 minutes depending on your system and network connection.
The following steps describe how to use this project to analyze single-cell data using the example script example.jl.
-
Configure Processing Cores: Set the number of CPU cores for processing:
ENV["NUM_CORES"] = "12"
-
Set up the environment: Load the necessary packages and include the local scICE file:
using CUDA, CSV, DataFrames, scLENS include("src/scICE.jl") using CairoMakie CairoMakie.activate!(type="png")
-
Device Selection: The device (CPU or GPU) is automatically selected based on CUDA availability:
cur_dev = if CUDA.has_cuda() "gpu" else "cpu" end
-
Data Preprocessing: Load your single-cell data (example uses compressed CSV) and preprocess it:
ndf = scLENS.read_file(raw"data/Z8eq.csv.gz") pre_df = scLENS.preprocess(ndf)
-
Embedding Creation: Create an embedding for the preprocessed data using
scLENS:sclens_embedding = scLENS.sclens(pre_df, device_=cur_dev) CSV.write("out/pca.csv", sclens_embedding[:pca_n1])
-
UMAP Transformation: Apply UMAP to the embedding and save the results:
scLENS.apply_umap!(sclens_embedding) CSV.write("out/umap.csv", DataFrame(sclens_embedding[:umap], :auto))
-
Visualization: Plot the UMAP distribution and save the output:
panel_0 = scLENS.plot_embedding(sclens_embedding) save("out/umap_dist.png",panel_0)
-
Applying scICE: Apply
scICEclustering to the embedding:clustering!(sclens_embedding)By default,
scICEexplores cluster numbers ranging from 1 to 20 (this is the default value for the optional second argumentr, as seen in the function signatureclustering!(a_dict, r=[1,20]; ...)). If you wish to focus the analysis on a specific range of cluster numbers, for instance, from 5 to 10 clusters, you provide this range as the second argument:clustering!(sclens_embedding, [5,10])
This enables you to find consistent cluster labels more efficiently within an anticipated range.
-
Inconsistency Coefficient Visualization: Visualize the Inconsistency coefficient and save it:
panel_1 = plot_ic(sclens_embedding) save("out/ic_plot.png",panel_1)
-
Consistent Cluster Label Extraction: Extract consistent cluster labels using
get_rlabel!and save them to a CSV file. This function filters labels based on an Inconsistency Coefficient (IC) threshold.label_out = get_rlabel!(sclens_embedding) CSV.write("out/consistent_labels.csv", label_out)
The IC threshold parameter (
th) defaults to1.005. This value is passed as the optional second argument toget_rlabel!and can be adjusted if needed. For example, to change the threshold to1.01, you would call the function like this:label_out = get_rlabel!(sclens_embedding, 1.01)
-
Cluster Visualization: Set the number of clusters and visualize them with labels:
n_clusters = 9 panel_2 = scLENS.plot_embedding(sclens_embedding, label_out[!, "l_$n_clusters"]) save("out/umap_dist_with_label$n_clusters.png",panel_2)
-
Save result as AnnData:
scLENS.save_anndata("out/test.h5ad",sclens_embedding)
Running the example.jl script with the provided sample data with ~10,000 cells:
- scLENS embedding (
sclensfunction): Approximately 2-5 minutes. - scICE clustering (
clustering!function): Approximately 10-15 minutes.
Note: Runtimes can vary significantly based on your hardware (CPU/GPU specifics, RAM), the number of cores configured, and the size/complexity of the input data.
Running the example.jl script will generate the following files in the out/ directory:
pca.csv: PCA results.umap.csv: UMAP coordinates.umap_dist.png: Visualization of the UMAP embedding.ic_plot.png: Plot of the inconsistency coefficient.consistent_labels.csv: Consistent cluster labels generated by scICE.umap_dist_with_label<n_clusters>.png: UMAP embedding colored by consistent cluster labels (e.g.,umap_dist_with_label9.png).test.h5ad: Output saved in AnnData format for compatibility with Python tools.