Aurora-GLM

Aurora-GLM is a modular, extensible, and high-performance Python framework for statistical modeling, focusing on Generalized Linear Models (GLM), Generalized Additive Models (GAM), and Generalized Additive Mixed Models (GAMM).

⚡ Up to 141× faster than NumPy with PyTorch CUDA on GPU ✅ Validated against R and statsmodels (max diff < 1e-11) 🎯 Modular design - Easy to extend with custom distributions and links 🚀 Multi-backend - NumPy, PyTorch, JAX with transparent API

Project Identity

Package name: aurora-glm
Python import: import aurora
Repository: github.com/Matcraft94/Aurora-GLM
Author: Lucy E. Arias (@Matcraft94)
Version: 0.7.0
Status: Phase 5 (85% complete) - Extended distributions, temporal covariance, GPU acceleration
Python: 3.10+
Tagline: Illuminating complex data with modern generalized linear modeling tools

Vision and Goals

Aurora-GLM aims to be:

Scientifically rigorous: Correct implementations validated against R (mgcv, lme4) and statsmodels
High performance: GPU-accelerated with multi-backend support (NumPy, PyTorch, JAX)
Extensible: Users can add custom distributions, link functions, and algorithms
Modular and functional: Clean design favoring composition over complex inheritance
Production-ready: Comprehensive testing, documentation, and real-world validation

Use Cases

Academic research: Ecology, epidemiology, social sciences
Pharmaceutical industry: Clinical trials, longitudinal analysis
Financial analysis: Credit scoring, risk modeling, insurance claims
Machine learning: Statistical foundations with modern tools and GPU acceleration

Performance Highlights

GPU Acceleration

PyTorch CUDA provides exceptional speedups for larger problems:

Problem Size	NumPy CPU	PyTorch CUDA	Speedup
Gaussian n=1,000	42ms	5ms	9×
Gaussian n=5,000	206ms	5ms	39×
Gaussian n=50,000	2.1s	18ms	116×
Poisson n=50,000	4.3s	30ms	141×

Benchmarked on NVIDIA RTX 5070 Ti. See PERFORMANCE.md for details.

Accuracy Validation

Aurora-GLM achieves excellent numerical agreement with reference implementations:

Comparison	Max Coefficient Difference
vs R glm()	< 1e-11
vs statsmodels (Gaussian)	< 1e-11
vs statsmodels (Poisson)	< 1e-10
vs statsmodels (Binomial)	< 1e-9
PyTorch vs NumPy	< 4e-7
JAX vs NumPy	< 1e-15

All backends produce consistent, validated results.

Sparse Matrix Performance

For large GAM/GAMM problems, sparse matrices provide significant benefits:

Problem Size	Dense	Sparse	Speedup	Memory Reduction
n=1,000, k=30	0.79s	0.15s	5.5×	6-8×
n=2,000, k=50	2.72s	0.37s	7.4×	6-8×
n=5,000, k=50	8.04s	1.51s	5.3×	6-8×

Key Features

Statistical Models

GLM: 10 distribution families (Gaussian, Poisson, Binomial, Gamma, Beta, Inverse Gaussian, Negative Binomial, Student-t, Tweedie, Quasi-families)
GAM: B-splines, natural cubic splines, thin plate splines with GCV/REML smoothing
GAMM: Random effects, temporal covariance (AR1, compound symmetry, Toeplitz), PQL for non-Gaussian

Performance Features

GPU acceleration: Up to 141× speedup with PyTorch CUDA
Multi-backend: NumPy, PyTorch, JAX with transparent API
Sparse matrices: 5-8× speedup and 6-8× memory reduction for large GAM/GAMM
Optimized algorithms: IRLS, Newton-Raphson, L-BFGS with numerical stability

Scientific Rigor

Validated: Against R (glm, mgcv, lme4) and statsmodels
Comprehensive tests: 494 tests with extensive coverage across all model families
R-style output: Summary tables, diagnostic plots, confidence intervals
Formula syntax: R/mgcv-compatible formulas like y ~ s(x1, k=10) + s(x2) + (1|group)

Extensibility

Custom distributions: Easy to add new distribution families
Custom link functions: Flexible link function framework
Custom algorithms: Pluggable optimization backends
Multi-backend pattern: Transparent support for NumPy/PyTorch/JAX

Installation

From Source (Development)

# Clone repository
git clone https://github.com/Matcraft94/Aurora-GLM.git
cd Aurora-GLM

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# Install in development mode
pip install -e .

# Optional: Install PyTorch for GPU acceleration
pip install torch

# Optional: Install JAX for additional backend
pip install jax jaxlib

Quick Start

Basic GLM Example

import numpy as np
from aurora.models.glm import fit_glm

# Generate Poisson count data
np.random.seed(42)
X = np.random.randn(200, 2)
y = np.random.poisson(np.exp(X[:, 0] * 0.5 - 0.3))

# Fit Poisson GLM with log link
result = fit_glm(X, y, family='poisson', link='log')

# Print R-style summary with coefficients, std errors, p-values
print(result.summary())

# Make predictions
X_new = np.random.randn(10, 2)
predictions = result.predict(X_new, type='response')

# Generate diagnostic plots
result.plot_diagnostics()

Output:

================================================================================
                    Generalized Linear Model Results
================================================================================
Family:                  Poisson           Link Function:         Log
No. Observations:            200           Df Residuals:          197
Df Model:                      2           Pseudo R-squared:    0.123
Converged:                   Yes           No. Iterations:          5
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
intercept       -0.2987     0.0712     -4.197      0.000     -0.4382     -0.1593  ***
X0               0.5124     0.0718      7.137      0.000      0.3717      0.6531  ***
X1              -0.0234     0.0706     -0.331      0.741     -0.1617      0.1150
================================================================================
Deviance:          212.54                  Null Deviance:       240.32
AIC:               465.43                  BIC:               475.52
================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Regression for Classification

# Generate binary classification data
X = np.random.randn(300, 2)
probabilities = 1 / (1 + np.exp(-(X @ np.array([1.2, -0.9]) + 0.3)))
y = np.random.binomial(1, probabilities)

# Fit logistic regression
result = fit_glm(X, y, family='binomial', link='logit')

# View summary
print(result.summary())

# Predict probabilities
y_prob = result.predict(X, type='response')

# Classification metrics
from aurora.validation.metrics import accuracy_score, concordance_index

y_pred = (y_prob >= 0.5).astype(int)
print(f"Accuracy: {accuracy_score(y, y_pred):.3f}")
print(f"C-index (AUROC): {concordance_index(y, y_prob):.3f}")

GPU Acceleration

import torch

# PyTorch tensors work transparently - automatically uses GPU if available
X_torch = torch.randn(100, 2).cuda()
y_torch = torch.poisson(torch.exp(X_torch[:, 0] * 0.5)).cuda()

# Same API, GPU backend - up to 141× faster!
result = fit_glm(X_torch, y_torch, family='poisson')
print(result.summary())

Multi-Backend Support

# Same code works with NumPy, PyTorch, and JAX
import torch

# PyTorch tensors work transparently
X_torch = torch.randn(100, 2)
y_torch = torch.poisson(torch.exp(X_torch[:, 0] * 0.5))

# Same API, PyTorch backend
result = fit_glm(X_torch, y_torch, family='poisson')
print(result.summary())

Cross-Validation

from aurora.validation.cross_val import cross_val_score, KFold

# Evaluate model with 5-fold cross-validation
scores = cross_val_score(
    X, y,
    family='poisson',
    cv=KFold(n_splits=5),
    metrics=['deviance', 'pseudo_r2']
)

print(f"Mean deviance: {scores['deviance'].mean():.2f}")
print(f"Mean pseudo R²: {scores['pseudo_r2'].mean():.3f}")

Case Studies and Examples

Aurora-GLM includes 17 comprehensive case studies covering diverse domains and real-world applications:

Available Case Studies

Finance & Insurance (5 studies):

Insurance pricing (Gamma GLM)
Medical insurance costs prediction
Customer churn prediction (Binomial GLM)
Motor claims frequency (Poisson/Negative Binomial)
E-commerce conversion optimization (Binomial GAM)

Environmental & Ecological (4 studies):

Air quality forecasting (Gaussian GAM)
Species distribution modeling (Binomial GAM)
Bike sharing demand prediction (GAM)
Traffic accident severity (GAMM)

Healthcare & Medicine (5 studies):

Sleep study analysis (Random effects GAMM)
Clinical trials (Gaussian GLM)
Longitudinal clinical data (GAMM with AR1)
Psychometric measurement (GAMM with crossed effects)
Survival analysis (Cox models)

Education (1 study):

Multilevel educational data (Nested GAMM)

Business & Retail (2 studies):

Health inspection scoring (Beta GLM)

Each case study is a self-contained Jupyter notebook with:

Problem statement and research questions
Data exploration and visualization
Model selection justification
Complete model fitting workflow
Results interpretation and diagnostics
Key findings and business implications

Getting started with examples:

# Navigate to examples directory
cd examples/06_case_studies

# Start Jupyter and open any notebook
jupyter notebook

# Recommended learning path:
# 1. 01_insurance_pricing.ipynb (Basic GLM)
# 2. 02_air_quality_gam.ipynb (GAM with smooths)
# 3. 04_sleep_study_gamm.ipynb (Random effects)
# 4. 06_clinical_trial_longitudinal.ipynb (Temporal covariance)

See examples/README.md for complete case study guide.

Recent Improvements (v0.7.0)

Code Quality Enhancements

All 5 TODO items completed and fully tested:

Formula Parser Validation - Prevents silent errors in R-style formulas
- Validates parenthesized terms require | for random effects
- Rejects empty parentheses
- Clear error messages for invalid syntax
Sparse EDF Computation - Improved GAM smoothing parameter selection
- Hybrid exact/approximate approach
- Exact for small basis (n_basis ≤ 100), fast approximation for large
- More accurate GCV scores
GAMM Log-Likelihood & Model Comparison - Enable AIC/BIC selection
- Proper log-likelihood computation for Gaussian GAMM
- AIC and BIC metrics for model comparison
- Critical for statistical inference
Memory-Mapped Streaming - Handle datasets larger than RAM
- True streaming for .npy files via memory mapping
- Only chunks loaded into memory
- Maintains full backward compatibility
Data Loading Efficiency - Production-ready ChunkedDataLoader
- Shuffling works correctly with streaming
- X and y remain aligned
- Fallback support for other formats (.npz, .csv)

All implementations manually tested and verified for correctness.

GAM API (Phase 3 - AVAILABLE NOW!)

Basic Univariate GAM

from aurora.models.gam import fit_gam
import numpy as np

# Generate noisy data with non-linear relationship
np.random.seed(42)
x = np.linspace(0, 1, 100)
y_true = np.sin(2 * np.pi * x)
y = y_true + 0.1 * np.random.randn(100)

# Fit GAM with automatic smoothing parameter selection (GCV)
result = fit_gam(x, y, n_basis=12, basis_type='bspline')

# Print model summary
print(result.summary())

# Make predictions at new points
x_new = np.linspace(0, 1, 200)
y_pred = result.predict(x_new)

Output:

============================================================
Generalized Additive Model (GAM) - Fitted Summary
============================================================

Model Information:
  Basis type:          BSplineBasis
  Number of basis:     12
  Observations:        100

Smoothing:
  Lambda:              8.806605e-02
  Effective DoF:       9.69
  GCV score:           3.662963e-02

Fit Statistics:
  Residual sum sq:     2.9872
  R-squared:           0.9421
  Residual std:        0.1727

Residuals:
  Min:                 -0.7174
  Q1:                  -0.0823
  Median:              0.0123
  Q3:                  0.0957
  Max:                 0.6197
============================================================

Multivariate Additive GAM

from aurora.models.gam import fit_additive_gam, SmoothTerm, ParametricTerm
import numpy as np

# Generate data with multiple nonlinear relationships
np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
y = (np.sin(2 * X[:, 0]) +         # Nonlinear effect of x1
     np.cos(X[:, 1]) +              # Nonlinear effect of x2
     2 * X[:, 2] +                  # Linear effect of x3
     0.1 * np.random.randn(n))

# Fit additive GAM: y = intercept + f1(x1) + f2(x2) + β*x3
result = fit_additive_gam(
    X, y,
    smooth_terms=[
        SmoothTerm(variable=0, n_basis=12, basis_type='bspline'),
        SmoothTerm(variable=1, n_basis=12, basis_type='cubic')
    ],
    parametric_terms=[
        ParametricTerm(variable=2)  # Linear term
    ]
)

# Print comprehensive summary
print(result.summary())

# Make predictions
X_new = np.random.randn(50, 3)
y_pred = result.predict(X_new)

Formula-Based API (R-style)

from aurora.models.gam import fit_gam_formula
import pandas as pd

# R-style formula with smooth terms
result = fit_gam_formula(
    formula="y ~ s(x1, k=10) + s(x2, bs='cubic') + x3",
    data=df,
    method='REML'  # or 'GCV'
)

# Tensor product interactions
result = fit_gam_formula(
    formula="y ~ te(x1, x2) + s(x3)",
    data=df
)

# Print comprehensive summary
print(result.summary())

# Visualize all smooth terms
from aurora.models.gam import plot_all_smooths
plot_all_smooths(result)

Visualizing Smooth Terms

from aurora.models.gam import plot_smooth, plot_all_smooths

# Plot single smooth term with confidence bands
fig = plot_smooth(
    result,
    term='s(0)',                    # or term=0 for first variable
    confidence_level=0.95,
    partial_residuals=True
)

# Plot all smooth terms in a grid
fig = plot_all_smooths(
    result,
    ncols=2,
    confidence_level=0.95
)

GAMM API (Phase 4 - AVAILABLE NOW!)

Basic Random Intercept Model

from aurora.models.gamm import fit_gamm, RandomEffect
import numpy as np

# Generate longitudinal data
np.random.seed(42)
n_subjects, n_per_subject = 10, 15
n = n_subjects * n_per_subject
subject_id = np.repeat(np.arange(n_subjects), n_per_subject)
time = np.tile(np.arange(n_per_subject), n_subjects)

# Design matrix
X = np.column_stack([np.ones(n), time])

# Random intercepts (subject-specific baseline)
b_subj = np.random.randn(n_subjects) * 0.8
y = 2.0 + 0.5 * time + b_subj[subject_id] + np.random.randn(n) * 0.3

# Fit GAMM with random intercept
re = RandomEffect(grouping='subject')
result = fit_gamm(
    y=y,
    X=X,
    random_effects=[re],
    groups_data={'subject': subject_id},
    covariance='identity'
)

# View results
print(f"Fixed effects (β): {result.beta_parametric}")
print(f"Variance components (Ψ): {result.variance_components}")
print(f"Residual variance (σ²): {result.residual_variance}")
print(f"AIC: {result.aic:.2f}, BIC: {result.bic:.2f}")

Random Intercept + Slope Model

# Random intercept + slope on 'time' (variable index 1)
re_slope = RandomEffect(
    grouping='subject',
    variables=(1,),  # Random slope for 'time'
    include_intercept=True
)

# Fit model with unstructured covariance (allows correlation)
result = fit_gamm(
    y=y,
    X=X,
    random_effects=[re_slope],
    groups_data={'subject': subject_id},
    covariance='unstructured'
)

# Extract variance-covariance matrix
psi = result.variance_components
print(f"Var(intercept): {psi[0, 0]:.3f}")
print(f"Cov(intercept, slope): {psi[0, 1]:.3f}")
print(f"Var(slope): {psi[1, 1]:.3f}")

# Correlation between random intercept and slope
corr = psi[0, 1] / np.sqrt(psi[0, 0] * psi[1, 1])
print(f"Correlation: {corr:.3f}")

Temporal Correlation with AR1

For longitudinal data with temporal autocorrelation:

from aurora.models.gamm import fit_gamm, RandomEffect

# AR1 covariance for temporally correlated observations
re_ar1 = RandomEffect(grouping='subject', covariance='ar1')

result = fit_gamm(
    y=y,
    X=X,
    random_effects=[re_ar1],
    groups_data={'subject': subject_id}
)

# Extract AR1 parameters
params = result.covariance_params[0]
sigma2 = np.exp(params[0])  # Variance
rho = np.tanh(params[1])    # Autocorrelation
print(f"AR1: σ² = {sigma2:.3f}, ρ = {rho:.3f}")

# Interpretation: observations at lag k have correlation ρ^k
# Example: ρ = 0.7 means adjacent observations have correlation 0.7,
# observations 2 time units apart have correlation 0.49, etc.

Compound Symmetry (Exchangeable Correlation)

For clustered data with equal within-cluster correlation:

# Compound symmetry: all pairs equally correlated
re_cs = RandomEffect(grouping='cluster', covariance='compound_symmetry')

result = fit_gamm(
    y=y,
    X=X,
    random_effects=[re_cs],
    groups_data={'cluster': cluster_id}
)

# Extract variance parameter
params = result.covariance_params[0]
sigma2 = np.exp(params[0])

# For compound symmetry, all within-cluster pairs have the same correlation
psi = result.variance_components[0]
print(f"Estimated variance: σ² = {sigma2:.3f}")
print("Use compound symmetry when cluster members are exchangeable")
print("(e.g., students in same school, patients in same hospital)")

Sparse Matrix Support for Large-Scale Models

For large datasets or models with many basis functions:

# Sparse GAMM (10-100× faster for large problems)
result = fit_gamm(
    formula="y ~ s(x, k=50) + (1 | subject)",
    data={"y": y, "x": x, "subject": subject_id},
    use_sparse=True  # Enable sparse matrices
)

# Benefits of sparse matrices:
# - Memory: 6-8× reduction for typical problems
# - Speed: 10-100× faster for n > 1000
# - Enables models that don't fit in memory with dense matrices

# When to use sparse:
# - Large datasets (n > 500, k > 20)
# - B-spline basis functions (naturally sparse)
# - Memory-constrained environments
# - Multiple smooth terms

Non-Gaussian GAMM (Poisson/Binomial with PQL)

from aurora.models.gamm import fit_pql_smooth

# Poisson GAMM for count data
result = fit_pql_smooth(
    y=counts,
    X=X,
    groups=subject_id,
    family='poisson',
    smooth_terms=[SmoothTerm(variable=0, n_basis=10)],
    max_iter=50
)

# Binomial GAMM for binary outcomes
result = fit_pql_smooth(
    y=binary_outcome,
    X=X,
    groups=subject_id,
    family='binomial',
    smooth_terms=[SmoothTerm(variable=0, n_basis=8)],
    max_iter=100
)

print(f"Fixed effects: {result.beta}")
print(f"Random effects variance: {result.variance_components}")

Current Implementation Status

Phase 1: Core Numerical Foundation - COMPLETED ✅

Backend Infrastructure:

✅ Backend abstraction layer (JAX, PyTorch)
✅ Type system with comprehensive Protocols
✅ Array namespace utilities for transparent NumPy/PyTorch/JAX compatibility

Optimization Algorithms:

✅ Newton-Raphson with automatic Hessian
✅ IRLS (Iteratively Reweighted Least Squares) with sparse matrix support
✅ L-BFGS with strong Wolfe line search
✅ Autodiff module (gradient, hessian, jacobian) for all backends

Distribution Families (10/10):

✅ Gaussian, Poisson, Binomial, Gamma
✅ Beta, Inverse Gaussian, Negative Binomial
✅ Student-t, Tweedie, Quasi-families

Link Functions (6/6):

✅ Identity, Log, Logit, Inverse, CLogLog, Probit

Phase 2: Basic GLM - COMPLETED ✅

✅ IRLS-based fit_glm() with multi-backend support
✅ R-style GLMResult.summary() with p-values, significance codes
✅ Diagnostic plots (residuals, Q-Q, scale-location, leverage)
✅ Confidence intervals, hypothesis tests, influence measures
✅ Cross-validation and comprehensive metrics
✅ Validation against statsmodels and R glm()

Phase 3: GAM (Splines and Smoothing) - COMPLETED ✅

✅ B-spline, natural cubic spline, thin plate spline bases
✅ GCV and REML smoothing parameter selection
✅ R-style formula parser (y ~ s(x1) + s(x2) + x3)
✅ Tensor product smooths (te(x1, x2))
✅ Visualization with confidence bands
✅ Sparse matrix support for large problems

Phase 4: GAMM (Random Effects) - COMPLETED ✅

✅ Random intercepts and slopes
✅ Nested and crossed random effects
✅ Multiple covariance structures (identity, unstructured, diagonal)
✅ AR1 and compound symmetry for temporal data
✅ Toeplitz covariance structure
✅ PQL estimation for non-Gaussian families
✅ Laplace approximation
✅ Sparse matrix optimization

Phase 5: Extended Features - IN PROGRESS 🚧 (85%)

Implemented:

✅ Extended distributions (Beta, Inverse Gaussian, Negative Binomial, Student-t, Tweedie)
✅ Temporal covariance structures (AR1, compound symmetry, Toeplitz)
✅ Sparse matrix support (6-8× memory reduction, 10-100× speedup)
✅ GPU acceleration benchmarks (up to 141× speedup)
✅ Multi-backend accuracy validation
✅ Comprehensive benchmark suite

Remaining:

📋 PyPI package publication
📋 Documentation website
📋 Additional spatial covariance structures
📋 Performance optimizations for massive datasets

Complete Feature Set

Distribution Families (10/10)

✅ Gaussian (Normal) - continuous data
✅ Poisson - count data
✅ Binomial - binary/proportions
✅ Gamma - positive continuous, right-skewed
✅ Beta - proportions in (0,1)
✅ Inverse Gaussian - positive durations (Wald distribution)
✅ Negative Binomial - overdispersed counts (NB2 parameterization)
✅ Student-t - heavy-tailed, robust regression
✅ Tweedie - compound Poisson-Gamma (insurance/actuarial)
✅ Quasi-families - Quasi-Poisson, Quasi-Binomial

Link Functions (6/6)

✅ Identity: g(μ) = μ
✅ Log: g(μ) = log(μ)
✅ Logit: g(μ) = log(μ/(1-μ))
✅ Probit: g(μ) = Φ⁻¹(μ)
✅ Inverse: g(μ) = 1/μ
✅ CLogLog: g(μ) = log(-log(1-μ))

GAM Features

✅ B-spline basis: Cox-de Boor recursion, local support, partition of unity
✅ Natural cubic splines: Truncated power basis with analytical penalties
✅ Thin plate splines: Multidimensional smoothing with radial basis functions
✅ Tensor products: te(x1, x2) for multidimensional interactions
✅ GCV smoothing: Generalized Cross-Validation for automatic λ selection
✅ REML smoothing: Restricted Maximum Likelihood for better multi-term selection
✅ Sparse matrices: 5-8× speedup and 6-8× memory reduction
✅ Formula parser: R/mgcv-compatible syntax

GAMM Features

✅ Random effects: Intercepts and slopes with flexible specification
✅ Nested/crossed effects: Complex hierarchies fully supported
✅ Covariance structures:
- Identity (diagonal)
- Unstructured (full variance-covariance)
- Diagonal (heterogeneous variances)
- AR1 (temporal autocorrelation with exponential decay)
- Compound symmetry (exchangeable correlation)
- Toeplitz (banded temporal correlations)
✅ PQL estimation: Penalized Quasi-Likelihood for non-Gaussian families
✅ Laplace approximation: Alternative estimation method
✅ Sparse matrices: Memory-efficient for large-scale models
✅ Formula syntax: lme4-style (1 + x | group) supported

Inference & Diagnostics

✅ Standard errors: Wald approximation with delta method
✅ P-values: Z-tests and likelihood ratio tests
✅ Confidence intervals: For coefficients and predictions
✅ Residuals: Response, Pearson, deviance, working, studentized
✅ Influence measures: Leverage, Cook's distance, DFBETAs
✅ Hypothesis tests: Wald tests for single and multi-constraint hypotheses
✅ Model comparison: AIC, BIC, pseudo R², deviance

Validation

✅ Cross-validation: KFold, StratifiedKFold with aggregated results
✅ Metrics: MSE, MAE, RMSE, accuracy, log-loss, Brier score, C-index
✅ Diagnostic plots: 4-panel residual plots (residuals, Q-Q, scale-location, leverage)
✅ Q-Q plots: Normal probability plots for residuals
✅ Caterpillar plots: Random effects visualization with confidence intervals

Benchmarking and Validation

Run comprehensive benchmarks comparing Aurora with statsmodels and R:

# Comprehensive benchmarks (accuracy + GPU performance + R comparison)
cd /tmp && PYTHONPATH=/path/to/Aurora-GLM python benchmarks/comprehensive_benchmarks.py

# Quick benchmarks (CI-friendly)
cd /tmp && PYTHONPATH=/path/to/Aurora-GLM python benchmarks/comprehensive_benchmarks.py --quick

# Performance benchmarks only
PYTHONPATH=. python benchmarks/performance_benchmarks.py

# GLM validation against statsmodels
PYTHONPATH=. python benchmarks/run_glm_checks.py --replicates 3

Note: Run from /tmp or another directory without an renv project to avoid R library path conflicts when using rpy2.

Results saved to benchmarks/results/. See PERFORMANCE.md for detailed analysis.

External Validation

Latest validation results:

vs R glm(): max |Δcoef| < 1e-11
vs statsmodels: max |Δcoef| < 2e-6
Multi-backend: JAX vs NumPy < 1e-15
PyTorch vs NumPy < 4e-7

All backends produce consistent, validated results suitable for scientific research.

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=aurora --cov-report=html

# Run specific module
pytest tests/test_models/test_glm_fitting.py

# Run specific test
pytest tests/test_distributions/test_links.py::test_identity_link_roundtrip

# Verbose output
pytest -v

Test Status: 494 tests collected (comprehensive coverage across GLM, GAM, GAMM, Bayesian inference, count models, and smoothing methods)

Code Quality

# Format code
ruff format aurora/ tests/

# Lint code
ruff check aurora/ tests/

# Type checking
mypy aurora/

# Run all quality checks before committing
ruff format aurora/ tests/ && ruff check aurora/ tests/ && mypy aurora/

Development Workflow

# Create feature branch
git checkout -b feature/implement-new-distribution

# Make changes and test
pytest
pytest --cov=aurora --cov-report=html

# Format and lint
ruff format aurora/ tests/
ruff check aurora/ tests/
mypy aurora/

# Commit with descriptive message
git commit -m "feat(distributions): add Exponential family"

# Push and create PR
git push origin feature/implement-new-distribution

Design Principles

Array Namespace Pattern

Aurora-GLM uses a namespace abstraction to support multiple array libraries transparently:

from aurora.distributions._utils import namespace, as_namespace_array

def my_function(x, y):
    # Automatically detect NumPy, PyTorch, or JAX
    xp = namespace(x, y)

    # Convert to appropriate array type
    x_arr = as_namespace_array(x, xp, like=y)

    # Use namespace-specific operations
    return xp.sum(x_arr * xp.exp(y))

This pattern ensures:

Zero code changes between backends
Automatic GPU support with PyTorch/JAX
Consistent results across all backends (validated < 1e-15)

Extensibility

Users can create custom distributions and link functions:

from aurora.distributions.base import Family, LinkFunction
from aurora.distributions._utils import namespace, ensure_positive

class MyDistribution(Family):
    def log_likelihood(self, y, mu, **params):
        xp = namespace(y, mu)
        mu = ensure_positive(mu, xp)
        # Implementation using namespace pattern
        # ...

    def deviance(self, y, mu, **params):
        # ...

    def variance(self, mu, **params):
        # V(μ) = variance function
        # ...

    def initialize(self, y):
        # Return initial μ estimates
        # ...

    @property
    def default_link(self):
        return MyLink()

class MyLink(LinkFunction):
    def link(self, mu):
        # η = g(μ)
        # ...

    def inverse(self, eta):
        # μ = g⁻¹(η)
        # ...

    def derivative(self, mu):
        # dη/dμ = g'(μ)
        # ...

When to Use Aurora-GLM

Choose Aurora-GLM when you need:

GPU acceleration - Up to 141× speedup for large datasets
GAM/GAMM capabilities - statsmodels doesn't support these
Multi-backend flexibility - PyTorch/JAX for GPU or autodiff
Advanced covariance structures - AR1, compound symmetry, Toeplitz for temporal/spatial data
Sparse matrix optimization - Handle models that don't fit in memory
Consistent API - GLM → GAM → GAMM with same interface
Research/teaching - Readable Python implementation with R-style output
Extensibility - Easy to add custom distributions and link functions

Choose statsmodels when you need:

Pure CPU GLM with maximum single-threaded speed
Extensive diagnostics and inference tools
Time series models (ARIMA, VAR, state space models)
Mature ecosystem with extensive documentation

Project Structure

aurora/
├── core/
│   ├── backends/         # Multi-backend support (JAX, PyTorch)
│   │   ├── _protocol.py  # Backend Protocol interface (internal)
│   │   ├── _registry.py  # Backend registration (internal)
│   │   └── operations.py # Backend-agnostic numerical operations
│   ├── autodiff/         # Automatic differentiation (gradient, hessian, jacobian)
│   ├── optimization/     # IRLS, Newton-Raphson, L-BFGS
│   └── linalg/           # Linear algebra primitives
├── distributions/
│   ├── families/         # 10 distribution families
│   ├── links/            # 6 link functions
│   └── _utils.py         # namespace(), as_namespace_array(), ensure_positive()
├── models/
│   ├── glm/              # fit_glm(), predictions, diagnostics
│   ├── gam/              # fit_gam(), formula parser, smooths
│   └── gamm/             # fit_gamm(), random effects, PQL, covariance structures
├── smoothing/
│   ├── splines/          # B-splines, cubic, thin plate
│   ├── penalties/        # Difference penalties, ridge
│   └── selection/        # GCV, REML smoothing selection
├── inference/
│   ├── hypothesis/       # Wald tests, likelihood ratio tests
│   ├── intervals/        # Confidence intervals
│   └── diagnostics/      # Residuals, influence measures
└── validation/
    ├── metrics/          # MSE, MAE, accuracy, C-index
    └── cross_val/        # KFold, cross_val_score

Roadmap

Completed ✅

Phase 1: Core infrastructure (backends, types, optimization)
Phase 2: Full GLM implementation with IRLS, diagnostics, inference
Phase 3: GAM with B-splines, natural cubic, thin plate, GCV/REML
Phase 4: GAMM with random effects, PQL for non-Gaussian families
Phase 5 (current): Extended distributions, temporal covariance, sparse matrices
494 tests with comprehensive coverage (5,731 new lines of test code)
Validation against statsmodels and R
GPU acceleration benchmarks (up to 141× speedup)
Multi-backend accuracy validation

In Progress 🚧 (Phase 5 - 85% complete)

PyPI package publication
Documentation with interactive examples
Research paper publication
Additional spatial covariance structures (exponential, Matérn)

Recently Completed 🎉

Comprehensive benchmark suite (1,591 lines): Multi-backend accuracy validation and GPU performance measurement
Bayesian GLM inference tests (1,060 lines): Prior specification, posterior sampling, convergence diagnostics
Zero-inflated count model tests (956 lines): ZIP and ZINB with excess zero validation
Hurdle count model tests (640 lines): Two-stage modeling of structural zeros
Smoothing method tests (1,083 lines): LOESS and P-spline implementation validation
Integration tests (401 lines): End-to-end workflows for count models

Total test coverage expansion: 5,731 lines across benchmarking, Bayesian inference, count models, and smoothing methods.

References and Mathematical Foundations

Aurora-GLM is built on rigorous statistical foundations with comprehensive mathematical documentation. For detailed mathematical formulations, proofs, and derivations, see REFERENCES.md.

Core Statistical Theory

Generalized Linear Models (GLM):

McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.
Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized linear models." JRSS: Series A, 135(3), 370-384.

Generalized Additive Models (GAM):

Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall/CRC.
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). CRC Press.
Wood, S. N. (2011). "Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models." JRSS: Series B, 73(1), 3-36.

Generalized Additive Mixed Models (GAMM):

Breslow, N. E., & Clayton, D. G. (1993). "Approximate inference in generalized linear mixed models." JASA, 88(421), 9-25.
Lin, X., & Breslow, N. E. (1996). "Bias correction in generalized linear mixed models with multiple components of dispersion." JASA, 91(435), 1007-1016.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). "Fitting linear mixed-effects models using lme4." Journal of Statistical Software, 67(1), 1-48.

Smoothing and Splines:

de Boor, C. (2001). A Practical Guide to Splines (Revised ed.). Springer.
Eilers, P. H. C., & Marx, B. D. (1996). "Flexible smoothing with B-splines and penalties." Statistical Science, 11(2), 89-121.
Craven, P., & Wahba, G. (1978). "Smoothing noisy data with spline functions." Numerische Mathematik, 31(4), 377-403.

Numerical Methods

Optimization:

Green, P. J. (1984). "Iteratively reweighted least squares for maximum likelihood estimation." JRSS: Series B, 46(2), 149-192.
Liu, D. C., & Nocedal, J. (1989). "On the limited memory BFGS method for large scale optimization." Mathematical Programming, 45(1-3), 503-528.

Numerical Stability:

Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.

Reference Implementations (Validation)

R glm(): Base stats package (GLM reference)
R mgcv: GAM/GAMM by Simon Wood (gold standard for smoothing)
R lme4: Mixed models by Bates et al. (GLMM reference)
statsmodels: Python statistical modeling library
scikit-learn: API design patterns and conventions

Technical Resources

JAX: Composable transformations - https://jax.readthedocs.io
PyTorch: Automatic differentiation - https://pytorch.org/docs
Array API Standard: Cross-library compatibility - https://data-apis.org/array-api

For a comprehensive list of mathematical foundations, algorithms, and validation references, including detailed equations and derivations, please see REFERENCES.md.

Contributing

Contributions are welcome! This project is in active development with many opportunities to help:

Areas for Contribution

Additional distributions - Zero-inflated, hurdle models
Performance optimizations - Further GPU optimizations, distributed computing
Documentation improvements - Tutorials, case studies, API reference
Real-world case studies - Applications in ecology, epidemiology, finance
Bug reports and fixes - Help improve stability and reliability

Contribution Guidelines

Please ensure:

Type hints on all public functions
Tests covering both NumPy and PyTorch backends
Docstrings in NumPy/Google format with examples
Code formatting with ruff format
Multi-backend support using the namespace pattern

Citation

If you use Aurora-GLM in your research, please cite it using the information in our CITATION.cff file, or use the following BibTeX entry:

@software{aurora_glm2025,
  title = {Aurora-GLM: Generalized Linear and Additive Models},
  author = {Arias, Lucy Eduardo},
  year = {2025},
  version = {0.7.0},
  url = {https://github.com/Matcraft94/Aurora-GLM},
  license = {MIT}
}

GitHub users can use the "Cite this repository" button in the right sidebar to automatically generate citations in various formats.

License

MIT License - see LICENSE file

Acknowledgments

Aurora-GLM draws inspiration from:

R's mgcv package by Simon Wood
R's lme4 package by Bates et al.
Python's statsmodels library
The JAX ecosystem for modern array programming

Special thanks to the open-source community for providing excellent tools and libraries.

Status: Phase 5 (85% complete) - Extended features, temporal covariance, GPU acceleration Version: 0.7.0 Tests: 494 tests collected (5,731 new lines added: benchmarks, Bayesian, count models, smoothing) Python: 3.10+ GPU: Up to 141× speedup with PyTorch CUDA Accuracy: Validated against R and statsmodels (< 1e-11) Maintained by: Lucy E. Arias (@Matcraft94)

Illuminating complex data with modern generalized linear and additive modeling tools.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
aurora		aurora
benchmarks		benchmarks
examples		examples
renv		renv
tests		tests
.Renviron		.Renviron
.Rprofile		.Rprofile
.coveragerc		.coveragerc
.gitignore		.gitignore
AUTHORS.txt		AUTHORS.txt
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
REFERENCES.md		REFERENCES.md
gamm_example.png		gamm_example.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
verify_references.py		verify_references.py

License

Matcraft94/Aurora-GLM

Folders and files

Latest commit

History

Repository files navigation

Aurora-GLM

Project Identity

Vision and Goals

Use Cases

Performance Highlights

GPU Acceleration

Accuracy Validation

Sparse Matrix Performance

Key Features

Statistical Models

Performance Features

Scientific Rigor

Extensibility

Installation

From Source (Development)

Quick Start

Basic GLM Example

Logistic Regression for Classification

GPU Acceleration

Multi-Backend Support

Cross-Validation

Case Studies and Examples

Available Case Studies

Recent Improvements (v0.7.0)

Code Quality Enhancements

GAM API (Phase 3 - AVAILABLE NOW!)

Basic Univariate GAM

Multivariate Additive GAM

Formula-Based API (R-style)

Visualizing Smooth Terms

GAMM API (Phase 4 - AVAILABLE NOW!)

Basic Random Intercept Model

Random Intercept + Slope Model

Temporal Correlation with AR1

Compound Symmetry (Exchangeable Correlation)

Sparse Matrix Support for Large-Scale Models

Non-Gaussian GAMM (Poisson/Binomial with PQL)

Current Implementation Status

Phase 1: Core Numerical Foundation - COMPLETED ✅

Phase 2: Basic GLM - COMPLETED ✅

Phase 3: GAM (Splines and Smoothing) - COMPLETED ✅

Phase 4: GAMM (Random Effects) - COMPLETED ✅

Phase 5: Extended Features - IN PROGRESS 🚧 (85%)

Complete Feature Set

Distribution Families (10/10)

Link Functions (6/6)

GAM Features

GAMM Features

Inference & Diagnostics

Validation

Benchmarking and Validation

External Validation

Development

Running Tests

Code Quality

Development Workflow

Design Principles

Array Namespace Pattern

Extensibility

When to Use Aurora-GLM

Choose Aurora-GLM when you need:

Choose statsmodels when you need:

Project Structure

Roadmap

Completed ✅

In Progress 🚧 (Phase 5 - 85% complete)

Recently Completed 🎉

References and Mathematical Foundations

Core Statistical Theory

Numerical Methods

Reference Implementations (Validation)

Technical Resources

Contributing

Areas for Contribution

Packages