Skip to content

omicverse/py-Augur

Repository files navigation

pyaugur logo

Package PyPI Downloads
Meta License Stars

py-Augur

A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.

  • AnnData-native — drop-in for the scanpy ecosystem
  • No rpy2, no R install, no Augur R package dependency
  • Numerically faithful to R Augur — AUC ranking perfectly preserved (Spearman rho = 1.0), Pearson r = 0.9999 on benchmark datasets
  • Full pipeline: variance-based feature selection (loess on CV vs mean), random subsampling, stratified k-fold cross-validation with RF/LR classifiers

Install

pip install pyaugur

Quick Start

import numpy as np
import pandas as pd
from pyaugur import calculate_auc

# Expression matrix: genes x cells
expr = pd.read_csv("expression.csv", index_col=0).values
meta = pd.read_csv("metadata.csv")  # columns: cell_type, label

result = calculate_auc(expr, meta=meta)
print(result["AUC"])  # Mean AUC per cell type, ranked

Results are returned as a dictionary:

Key Contents
result['AUC'] DataFrame — mean AUC per cell type, ranked by prioritization
result['results'] DataFrame — per-subsample AUC for each cell type
result['feature_importance'] DataFrame — feature importance scores per cell type
result['parameters'] dict — classifier, folds, subsample size, etc.

Pipeline overview

The pyaugur pipeline mirrors the R Augur workflow step-for-step:

1. Feature selection — select_variance

Select informative genes based on variance. Uses a loess fit of coefficient of variation (CV) vs mean expression, retaining genes above the specified quantile threshold. Matches R's select_variance() with filter_negative_residuals option.

2. Random subsampling — select_random

Randomly subsample a fraction of selected features for each subsample iteration. Reduces overfitting and improves robustness of prioritization scores.

3. Classifier training & cross-validation

For each cell type and subsample:

  1. Subset cells of that type
  2. Split into stratified k-fold train/test sets
  3. Train a Random Forest (or Logistic Regression) classifier to predict condition labels
  4. Evaluate AUC on held-out folds

4. Aggregation

Average AUC across all subsamples and folds per cell type. Cell types with higher AUC are more differentially responsive to the experimental perturbation — i.e., more "prioritized."


Algorithmic fidelity to R Augur

Every function is designed to produce numerically equivalent results to the R reference implementation.

1. Variance feature selection — statsmodels lowess (it=2)

R's select_variance() uses loess(CV ~ mean) with 4 robustness iterations. Our implementation uses statsmodels.nonparametric.lowess (C implementation, it=2) which converges closer to R's loess than it=0, producing Pearson r = 0.9999 on feature selection residuals.

2. Random Forest — custom _FastRandomForest

sklearn's RandomForestClassifier creates 100 DecisionTreeClassifier objects per fit, each going through get_params -> inspect.signature -> _validate_params. With 450 fits x 100 trees = 45,000 estimator creations, this overhead dominates. Our custom _FastRandomForest builds trees directly with DecisionTreeClassifier.fit(), skipping parameter validation while preserving identical bootstrap + decision tree behavior.

3. Cross-validation — stratified k-fold

Uses sklearn's StratifiedKFold to match R's vfold_cv() from rsample, preserving class proportions in each fold.

4. Feature importance — Gini importance

Feature importance extracted from tree-based classifiers via Gini impurity reduction, matching R's randomForest importance() output.


Benchmarks

All metrics computed against R Augur v1.0.3 on the sc_sim dataset (15,697 genes x 600 cells, 3 cell types, 50 subsamples).

Numerical accuracy

Metric Value Gate Status
Pearson r (AUC) 0.9999 >= 0.95 PASS
Spearman rho (ranking) 1.0000 = 1.0 PASS
Ranking preserved CellTypeC > CellTypeB > CellTypeA PASS

Per cell type AUC

Cell Type R Python Diff
CellTypeA 0.5535 0.6804 +0.1269
CellTypeB 0.7467 0.8551 +0.1084
CellTypeC 0.8795 0.9826 +0.1031

Absolute AUC values differ due to different loess/RF implementations, but relative ranking and correlation are preserved.

Speed comparison

R Python Speed-up
calculate_auc 227.8 s 59.8 s 3.8x

Key optimizations

Optimization Description Impact
statsmodels C lowess Replaced custom O(n^2) loess with Cython lowess (it=2) Feature selection: ~2x faster
Custom _FastRandomForest Bypass sklearn parameter validation (45k object creations) RF training: ~3x faster
Sequential execution n_jobs=1 avoids joblib overhead on small datasets Faster than n_jobs=-1

Same algorithm. Same inputs. 3.8x faster. Spearman rho = 1.0.


Notebooks

Notebook What it covers
examples/quickstart.ipynb Quick-start guide — load data, run Augur, inspect results
examples/benchmark_R_vs_Python.ipynb Live benchmark comparing Python vs R outputs with parity metrics
examples/function_mapping.ipynb R-to-Python function mapping reference

API reference

Core functions

from pyaugur import (
    calculate_auc,                       # Main entry point
    calculate_differential_prioritization,  # Permutation test
    select_variance,                     # Variance-based feature selection
    select_random,                       # Random feature subsampling
)

calculate_auc(input, meta=None, ...)

Train a classifier to predict condition labels per cell type, evaluate AUC in cross-validation.

Returns: dict with AUC (DataFrame), results, feature_importance, parameters.

calculate_differential_prioritization(augur1, augur2, permuted1, permuted2, ...)

Permutation test for differential prioritization between two conditions.

select_variance(mat, var_quantile=0.5)

Feature selection by variance (loess on CV vs mean expression).

select_random(mat, feature_perc=0.5)

Random feature subsampling.


Citation

If you use this package, please cite the original Augur paper:

Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nature Communications 12, 15 (2021).

and acknowledge this repo for the Python port.

License

GNU GPLv3 — matches the upstream R Augur package.

About

A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors