Offline policy evaluation for service-time minimization using Doubly Robust (DR) and Stabilized Doubly Robust (SNDR) estimators with time-aware splits and calibration. Now with pairwise evaluation and autoscaling support.
skdr-eval is a Python package for offline policy evaluation in service-time optimization scenarios. It implements state-of-the-art Doubly Robust (DR) and Stabilized Doubly Robust (SNDR) estimators with time-aware cross-validation and calibration. The package is designed for evaluating machine learning models that make decisions about service allocation, with special support for pairwise (client-operator) evaluation and autoscaling strategies.
- Features
- Installation
- Quick Start
- API Reference
- Theory
- Implementation Details
- Bootstrap Confidence Intervals
- Examples
- Development
- Citation
- 🎯 Doubly Robust Estimation: Implements both DR and Stabilized DR (SNDR) estimators
- ⏰ Time-Aware Evaluation: Uses time-series splits and calibrated propensity scores
- 🔧 Sklearn Integration: Easy integration with scikit-learn models
- 📊 Comprehensive Diagnostics: ESS, match rates, propensity score analysis
- 🚀 Production Ready: Type-hinted, tested, and documented
- 📈 Bootstrap Confidence Intervals: Moving-block bootstrap for time-series data
- 🤝 Pairwise Evaluation: Client-operator pairwise evaluation with autoscaling strategies
- 🎛️ Autoscaling: Direct, stream, and stream_topk strategies with policy induction
- 🧮 Choice Models: Conditional logit models for propensity estimation
pip install skdr-evalFor choice models (conditional logit):
pip install skdr-eval[choice]For speed optimizations (PyArrow, Polars):
pip install skdr-eval[speed]For development:
git clone https://github.com/dandrsantos/skdr-eval.git
cd skdr-eval
pip install -e .[dev]import skdr_eval
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
# 1. Generate synthetic service logs
logs, ops_all, true_q = skdr_eval.make_synth_logs(n=5000, n_ops=5, seed=42)
# 2. Define candidate models
models = {
"RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
"HistGradientBoosting": HistGradientBoostingRegressor(random_state=42),
}
# 3. Evaluate models using DR and SNDR
report, detailed_results = skdr_eval.evaluate_sklearn_models(
logs=logs,
models=models,
fit_models=True,
n_splits=3,
random_state=42,
)
# 4. View results
print(report[['model', 'estimator', 'V_hat', 'ESS', 'match_rate']])import skdr_eval
from sklearn.ensemble import HistGradientBoostingRegressor, HistGradientBoostingClassifier
# 1. Generate synthetic pairwise data (client-operator pairs)
logs_df, op_daily_df = skdr_eval.make_pairwise_synth(
n_days=5,
n_clients_day=1000,
n_ops=20,
seed=42
)
# 2. Define models for different tasks
models = {
"ServiceTime": HistGradientBoostingRegressor(random_state=42),
"Binary": HistGradientBoostingClassifier(random_state=42),
}
# 3. Run pairwise evaluation with autoscaling
results = skdr_eval.evaluate_pairwise_models(
logs_df=logs_df,
op_daily_df=op_daily_df,
models=models,
autoscale_strategies=["direct", "stream", "stream_topk"],
n_splits=3,
random_state=42
)
# 4. View autoscaling results
for strategy, result in results.items():
print(f"{strategy}: V_hat = {result['V_hat']:.4f}, ESS = {result['ESS']:.1f}")Generate synthetic service logs for evaluation.
Returns:
logs: DataFrame with service logsops_all: Index of all operator namestrue_q: Ground truth service times
Build design matrices from logs.
Returns:
Design: Dataclass with feature matrices and metadata
Evaluate sklearn models using DR and SNDR estimators.
Parameters:
logs: Service log DataFramemodels: Dict of model name to sklearn estimatorfit_models: Whether to fit models (default: True)n_splits: Number of time-series splits (default: 3)random_state: Random seed for reproducibility
Evaluate models using pairwise (client-operator) evaluation with autoscaling.
Parameters:
logs_df: Pairwise decision log DataFrameop_daily_df: Daily operator availability DataFramemodels: Dict of model name to sklearn estimatorautoscale_strategies: List of strategies ("direct", "stream", "stream_topk")n_splits: Number of time-series splits (default: 3)random_state: Random seed for reproducibility
Returns:
- Dict mapping strategy names to evaluation results
Generate synthetic pairwise (client-operator) data for evaluation.
Parameters:
n_days: Number of days to simulaten_clients_day: Number of clients per dayn_ops: Number of operatorsseed: Random seed for reproducibilitybinary: Whether to generate binary outcomes (default: False)
Returns:
logs_df: DataFrame with pairwise decisionsop_daily_df: DataFrame with daily operator data
Fit propensity model with time-aware cross-validation and isotonic calibration.
Fit outcome model with cross-fitting. Supports 'hgb', 'ridge', 'rf', or custom estimators.
Compute DR and SNDR values with automatic clipping threshold selection.
Compute confidence intervals using moving-block bootstrap for time-series data.
Doubly Robust (DR) estimation provides unbiased policy evaluation when either the propensity model OR the outcome model is correctly specified. The estimator is:
V̂_DR = (1/n) Σ [q̂_π(x_i) + w_i * (y_i - q̂(x_i, a_i))]
Stabilized DR (SNDR) reduces variance by normalizing importance weights:
V̂_SNDR = (1/n) Σ q̂_π(x_i) + [Σ w_i * (y_i - q̂(x_i, a_i))] / [Σ w_i]
Where:
q̂_π(x)= expected outcome under evaluation policy πq̂(x,a)= outcome model predictionw_i = π(a_i|x_i) / e(a_i|x_i)= importance weight (clipped)e(a_i|x_i)= propensity score (calibrated)
- Direct: Uses the logging policy directly without modification
- Stream: Induces a policy from sklearn models and applies it to streaming decisions
- Stream TopK: Similar to stream but restricts choices to top-K operators based on predicted service times
- Time-Series Aware: Uses
TimeSeriesSplitfor all cross-validation with temporal ordering - Calibrated Propensities: Per-fold isotonic calibration via
CalibratedClassifierCV - Automatic Clipping: Smart threshold selection to minimize variance while maintaining ESS
- Comprehensive Diagnostics: ESS, match rates, propensity quantiles, and tail mass analysis
For time-series data, use moving-block bootstrap:
# Enable bootstrap CIs
report, _ = skdr_eval.evaluate_sklearn_models(
logs=logs,
models=models,
ci_bootstrap=True,
alpha=0.05, # 95% confidence
)
print(report[['model', 'estimator', 'V_hat', 'ci_lower', 'ci_upper']])See examples/quickstart.py for a complete example, or run:
python examples/quickstart.pygit clone https://github.com/dandrsantos/skdr-eval.git
cd skdr-eval
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .[dev]pytest -vruff check src/ tests/ examples/
ruff format src/ tests/ examples/
mypy src/skdr_eval/pre-commit install
pre-commit run --all-filespython -m buildThis package uses Trusted Publishing (PEP 740) for secure PyPI releases.
- Create a GitHub release with a version tag (e.g.,
v0.1.0) - The
release.ymlworkflow will automatically build and publish
If Trusted Publishing is not configured:
- Set up PyPI API token: https://pypi.org/manage/account/token/
- Build the package:
python -m build - Upload:
twine upload dist/*
- Go to https://pypi.org/manage/project/skdr-eval/settings/publishing/
- Add GitHub repository as trusted publisher:
- Repository:
dandrsantos/skdr-eval - Workflow:
release.yml - Environment:
release
- Repository:
If you use this software in your research, please cite:
@software{santos2024skdr,
title = {skdr-eval: Offline Policy Evaluation for Service-Time Minimization},
author = {Santos, Diogo},
year = {2024},
url = {https://github.com/dandrsantos/skdr-eval},
version = {0.1.0}
}This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Built with scikit-learn for machine learning
- Uses pandas for data manipulation
- Follows PEP 621 for project metadata