A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.
- AnnData-native — drop-in for the scanpy ecosystem
- No
rpy2, no R install, no Augur R package dependency - Numerically faithful to R Augur — AUC ranking perfectly preserved (Spearman rho = 1.0), Pearson r = 0.9999 on benchmark datasets
- Full pipeline: variance-based feature selection (loess on CV vs mean), random subsampling, stratified k-fold cross-validation with RF/LR classifiers
pip install pyaugurimport numpy as np
import pandas as pd
from pyaugur import calculate_auc
# Expression matrix: genes x cells
expr = pd.read_csv("expression.csv", index_col=0).values
meta = pd.read_csv("metadata.csv") # columns: cell_type, label
result = calculate_auc(expr, meta=meta)
print(result["AUC"]) # Mean AUC per cell type, rankedResults are returned as a dictionary:
| Key | Contents |
|---|---|
result['AUC'] |
DataFrame — mean AUC per cell type, ranked by prioritization |
result['results'] |
DataFrame — per-subsample AUC for each cell type |
result['feature_importance'] |
DataFrame — feature importance scores per cell type |
result['parameters'] |
dict — classifier, folds, subsample size, etc. |
The pyaugur pipeline mirrors the R Augur workflow step-for-step:
Select informative genes based on variance. Uses a loess fit of coefficient of variation (CV) vs mean expression, retaining genes above the specified quantile threshold. Matches R's select_variance() with filter_negative_residuals option.
Randomly subsample a fraction of selected features for each subsample iteration. Reduces overfitting and improves robustness of prioritization scores.
For each cell type and subsample:
- Subset cells of that type
- Split into stratified k-fold train/test sets
- Train a Random Forest (or Logistic Regression) classifier to predict condition labels
- Evaluate AUC on held-out folds
Average AUC across all subsamples and folds per cell type. Cell types with higher AUC are more differentially responsive to the experimental perturbation — i.e., more "prioritized."
Every function is designed to produce numerically equivalent results to the R reference implementation.
R's select_variance() uses loess(CV ~ mean) with 4 robustness iterations. Our implementation uses statsmodels.nonparametric.lowess (C implementation, it=2) which converges closer to R's loess than it=0, producing Pearson r = 0.9999 on feature selection residuals.
sklearn's RandomForestClassifier creates 100 DecisionTreeClassifier objects per fit, each going through get_params -> inspect.signature -> _validate_params. With 450 fits x 100 trees = 45,000 estimator creations, this overhead dominates. Our custom _FastRandomForest builds trees directly with DecisionTreeClassifier.fit(), skipping parameter validation while preserving identical bootstrap + decision tree behavior.
Uses sklearn's StratifiedKFold to match R's vfold_cv() from rsample, preserving class proportions in each fold.
Feature importance extracted from tree-based classifiers via Gini impurity reduction, matching R's randomForest importance() output.
All metrics computed against R Augur v1.0.3 on the sc_sim dataset (15,697 genes x 600 cells, 3 cell types, 50 subsamples).
| Metric | Value | Gate | Status |
|---|---|---|---|
| Pearson r (AUC) | 0.9999 | >= 0.95 | PASS |
| Spearman rho (ranking) | 1.0000 | = 1.0 | PASS |
| Ranking preserved | CellTypeC > CellTypeB > CellTypeA | — | PASS |
| Cell Type | R | Python | Diff |
|---|---|---|---|
| CellTypeA | 0.5535 | 0.6804 | +0.1269 |
| CellTypeB | 0.7467 | 0.8551 | +0.1084 |
| CellTypeC | 0.8795 | 0.9826 | +0.1031 |
Absolute AUC values differ due to different loess/RF implementations, but relative ranking and correlation are preserved.
| R | Python | Speed-up | |
|---|---|---|---|
calculate_auc |
227.8 s | 59.8 s | 3.8x |
| Optimization | Description | Impact |
|---|---|---|
| statsmodels C lowess | Replaced custom O(n^2) loess with Cython lowess (it=2) | Feature selection: ~2x faster |
| Custom _FastRandomForest | Bypass sklearn parameter validation (45k object creations) | RF training: ~3x faster |
| Sequential execution | n_jobs=1 avoids joblib overhead on small datasets | Faster than n_jobs=-1 |
Same algorithm. Same inputs. 3.8x faster. Spearman rho = 1.0.
| Notebook | What it covers |
|---|---|
examples/quickstart.ipynb |
Quick-start guide — load data, run Augur, inspect results |
examples/benchmark_R_vs_Python.ipynb |
Live benchmark comparing Python vs R outputs with parity metrics |
examples/function_mapping.ipynb |
R-to-Python function mapping reference |
from pyaugur import (
calculate_auc, # Main entry point
calculate_differential_prioritization, # Permutation test
select_variance, # Variance-based feature selection
select_random, # Random feature subsampling
)Train a classifier to predict condition labels per cell type, evaluate AUC in cross-validation.
Returns: dict with AUC (DataFrame), results, feature_importance, parameters.
Permutation test for differential prioritization between two conditions.
Feature selection by variance (loess on CV vs mean expression).
Random feature subsampling.
If you use this package, please cite the original Augur paper:
Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nature Communications 12, 15 (2021).
and acknowledge this repo for the Python port.
GNU GPLv3 — matches the upstream R Augur package.
