Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
example_data		example_data
gene_names		gene_names
.gitignore		.gitignore
README.md		README.md
extractFeatures.py		extractFeatures.py
runEnsembleKQC.py		runEnsembleKQC.py
utils.py		utils.py

Repository files navigation

EnsembleKQC

An Unsupervised Ensemble Learning Method for Quality Control of Single Cell RNA-seq Sequencing Data

Requirements

Python 3: The base language.
Scikit-learn: For machine learning algorithms.
Numpy: For numerical operations.
Pandas: For data manipulation and analysis.
multiprocessing: For parallel processing (standard library).
argparse: For parsing command-line arguments (standard library).
itertools: For creating iterators for efficient looping (standard library).
time: For measuring execution time (standard library).

Preprocessing

EnsembleKQC uses the following five features to detect low-quality cells:

Actb TPM expression
Gapdh TPM expression
Metabolic process genes' TPM expression
The number of detected genes
Mapping rate

Users need to first extract these values and store them in a CSV file similar to those in the example_data directory before running EnsembleKQC.

We also provide a simple code to extract features with parameters for the organism of interest and whether to perform normalization (recommended but accounting for if a user may already have normalized counts).

# Basic usage with normalization (default)
$ python extractFeatures.py file_name=./example_data/expression_matrix.csv out_file_name=./output_data/features.csv organism=human

Here the expression matrix is a Genes X Cells FPKM or UMI matrix. Row names are gene names and column names are cell sample names. Note this code only extracts the first four features. If the mapping rate of your dataset is provided, fulfill this feature using the real mapping rate, or just add a column called "Mapping rate" in the CSV file and set all values in this column as 1.

Usage

Download all files and run following command to display help message

$ python runEnsembleKQC.py --help

usage: runEnsembleKQC.py [-h] [--input_path INPUT_PATH]
                         [--lower_bound LOWER_BOUND]
                         [--upper_bound UPPER_BOUND] [--labeled LABELED]
                         [--output_path OUTPUT_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --input_path INPUT_PATH
                        path of input data
  --lower_bound LOWER_BOUND
                        lower bound of estimated low-quality cells number
  --upper_bound UPPER_BOUND
                        upper bound of estimated low-quality cells number
  --labeled LABELED     whether the data has quality labels. If true,
                        evaluation information will be printed
  --output_path OUTPUT_PATH
                        path of output data

Example

To simply run EnsembleKQC without any prior knowledge:

# 1. Basic usage with example data
$ python runEnsembleKQC.py --input_path=./example_data/Kolodziejczyk.csv --labeled=true --output_path=./output_data/results.csv

Users can also provide their own estimated range of low-quality cells number:

$ python runEnsembleKQC.py --input_path=./example_data/Kolodziejczyk.csv --lower_bound=10 --upper_bound=50 --labeled=true --output_path=./output_data/results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnsembleKQC

Requirements

Preprocessing

Usage

Example

About

Releases

Packages

Languages

AlicenJoyHenning/EnsembleKQC

Folders and files

Latest commit

History

Repository files navigation

EnsembleKQC

Requirements

Preprocessing

Usage

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages