A production-grade statistical arbitrage (stat-arb) trading system that identifies market mispricings through quantitative factor analysis, portfolio optimization, and systematic execution. The system processes historical market data, generates alpha signals from multiple strategies, optimizes portfolio positions considering transaction costs and risk, and backtests trading strategies through multiple simulation engines.
This system implements a complete workflow for statistical arbitrage trading:
- Data Loading & Preprocessing: Loads and processes market data from multiple sources
- Alpha Generation: Calculates predictive signals from 20+ trading strategies
- Factor Analysis: Decomposes returns using PCA and Barra risk models
- Portfolio Optimization: Maximizes risk-adjusted returns with realistic constraints
- Backtesting: Simulates execution across multiple engines with transaction cost modeling
The system is designed for daily rebalancing across ~1,400 US equities with sophisticated risk management and execution cost modeling.
- Features
- Architecture
- Installation
- Quick Start
- Data Requirements
- Usage
- Strategies
- Simulation Engines
- Configuration
- Project Structure
- Salamander Module
- Performance Metrics
- Multi-Source Data Integration: Daily/intraday prices, Barra factors, analyst estimates, short locates
- 20+ Alpha Strategies: PCA decomposition, analyst signals, momentum, mean reversion, order flow
- Advanced Optimization: NLP solver with factor risk, transaction costs, and participation constraints
- Multiple Simulation Engines: Daily (BSIM), order-level (OSIM), intraday (QSIM), full system (SSIM)
- Risk Management: Factor exposure limits, position sizing, sector neutrality
- Realistic Execution Modeling: Market impact, slippage, borrow costs, VWAP vs. close fills
- HDF5 Caching: Fast data loading with compressed storage
- Vectorized Operations: Efficient pandas/numpy operations for large datasets
- Rolling Window Analysis: Adaptive factor models with 30-60 day windows
- Winsorization: Robust outlier handling at 5-sigma levels
- Corporate Action Handling: Automatic adjustment for splits and dividends
Raw Market Data (CSV/SQL)
↓
Load & Merge (loaddata.py)
↓
Calculate Returns & Features (calc.py)
↓
Filter Tradable Universe
↓
Generate Alpha Signals (strategy files)
↓
Fit Regression Coefficients (regress.py)
↓
PCA Decomposition (pca.py) [optional]
↓
Portfolio Optimization (opt.py)
↓
Simulation Engines (bsim/osim/qsim/ssim)
↓
Performance Analysis & Reporting
| Component | File | Description |
|---|---|---|
| Data Loading | loaddata.py |
Load market data, fundamentals, analyst estimates |
| Calculations | calc.py |
Forward returns, volume profiles, winsorization |
| Regression | regress.py |
Fit alpha factors to forward returns (WLS) |
| PCA | pca.py |
Principal component decomposition |
| Optimization | opt.py |
Portfolio optimization with OpenOpt NLP |
| Big Sim | bsim.py |
Daily rebalancing backtest |
| Order Sim | osim.py |
Order-level execution backtest |
| Quote Sim | qsim.py |
Intraday 30-min bar backtest |
| System Sim | ssim.py |
Full lifecycle position tracking |
| Utilities | util.py |
Helper functions for data merging |
CRITICAL: This codebase has two separate Python environments:
- All core modules: loaddata, calc, regress, opt, util
- All simulation engines: bsim, osim, qsim, ssim
- All alpha strategies: hl, bd, analyst, eps, etc.
- All production modules: prod_sal, prod_eps, prod_rtg, prod_tgt
- Reason: OpenOpt dependency (not Python 3 compatible)
- Location:
salamander/directory - Purpose: Simplified, standalone, Python 3 compatible version
- Use Case: Modern deployments, easier development
- Separate Dependencies:
salamander/requirements.txt
Migration Note: The salamander module provides a migration path to Python 3, with simplified data pipelines and compatible optimization. Consider using salamander for new development.
Python 2.7.x
numpy==1.16.0
pandas==0.23.4
OpenOpt==0.5628
FuncDesigner==0.5628
statsmodels
scikit-learn
matplotlib
scipy
lmfit
tables (PyTables for HDF5)
mysql-connector-python (optional, for SQL data sources)
Python 3.6+
numpy>=1.19.0
pandas>=1.1.0
scipy>=1.5.0
scikit-learn>=0.23.0
matplotlib>=3.3.0
tables>=3.6.0
lmfit>=1.0.0
# Clone the repository
git clone https://github.com/yourusername/statarb.git
cd statarb
# Create Python 2.7 virtual environment (if using virtualenv)
virtualenv -p python2.7 venv27
source venv27/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install OpenOpt (may require manual installation)
pip install openopt funcdesigner
# Optional: Build Cython optimization module
python setup.py build_ext --inplace
# Verify installation
python -c "from loaddata import *; print('Success')"cd salamander
# Create Python 3 virtual environment
python3 -m venv venv3
source venv3/bin/activate
# Install dependencies
pip install -r requirements.txt
# Verify installation
python3 -c "from loaddata import *; print('Success')"-
OpenOpt Installation:
- OpenOpt is no longer actively maintained
- May require
--no-depsflag:pip install --no-deps openopt - Alternative: Install from source
-
NumPy/Pandas Compatibility:
- Python 2.7 requires specific versions (NumPy 1.16, Pandas 0.23)
- Newer versions break Python 2.7 compatibility
-
PyTables (HDF5):
- Required for HDF5 file operations
pip install tablesmay require HDF5 C libraries- Ubuntu:
sudo apt-get install libhdf5-dev - macOS:
brew install hdf5
-
MySQL Connector (optional):
- Only needed if using SQL data sources
pip install mysql-connector-python
Set the base directories in loaddata.py:
UNIV_BASE_DIR = "/path/to/universe/"
PRICE_BASE_DIR = "/path/to/prices/"
BARRA_BASE_DIR = "/path/to/barra/"
BAR_BASE_DIR = "/path/to/bars/"
EARNINGS_BASE_DIR = "/path/to/earnings/"
LOCATES_BASE_DIR = "/path/to/locates/"
ESTIMATES_BASE_DIR = "/path/to/estimates/"# Run BSIM with a single alpha signal
python bsim.py --start=20130101 --end=20130630 \
--fcast=hl:1:1 \
--kappa=2e-8 \
--maxnot=200e6# Combine high-low and beta-adjusted signals
python bsim.py --start=20130101 --end=20130630 \
--fcast=hl:1:0.6,bd:0.8:0.4 \
--kappa=2e-8Edit the following constants in loaddata.py to point to your data directories:
UNIV_BASE_DIR = "/path/to/universe/" # Stock universe files
PRICE_BASE_DIR = "/path/to/prices/" # Daily OHLCV data
BARRA_BASE_DIR = "/path/to/barra/" # Barra risk model factors
BAR_BASE_DIR = "/path/to/bars/" # Intraday 30-min bars
EARNINGS_BASE_DIR = "/path/to/earnings/" # Earnings announcement dates
LOCATES_BASE_DIR = "/path/to/locates/" # Short borrow availability
ESTIMATES_BASE_DIR = "/path/to/estimates/" # Analyst estimates (IBES)Purpose: Define tradable stock universe for each date
Required Columns:
sid(int): Security identifierticker_root(str): Stock ticker symbolstatus(str): Trading status (e.g., 'ACTIVE')country(str): Country code (filter: 'USA')currency(str): Currency code (filter: 'USD')
File Format: CSV with daily snapshots
File Naming: YYYYMMDD.csv (e.g., 20130115.csv)
Directory Structure: UNIV_BASE_DIR/YYYY/YYYYMMDD.csv
Purpose: Daily OHLCV market data
Required Columns:
sid(int): Security identifierticker(str): Full ticker symbolopen,high,low,close(float): Daily pricesvolume(int): Share volumemkt_cap(float): Market capitalization ($)advp(float): Average dollar volume (calculated)
File Format: CSV with daily snapshots Data Quality: Handle splits, dividends, delistings Lookback: Minimum 60 days for rolling calculations
Purpose: Multi-factor risk model exposures
Required Barra Factors (13):
country- Country factorgrowth- Growth factorsize- Market cap factorsizenl- Non-linear sizedivyild- Dividend yieldbtop- Book-to-priceearnyild- Earnings yieldbeta- Market betaresvol- Residual volatilitybetanl- Non-linear betamomentum- Price momentumleverage- Financial leverageliquidty- Trading liquidity
Industry Classifications: 58 GICS industries (ind1-ind58)
Additional Columns:
barraResidRet- Idiosyncratic returnsbarraSpecRisk- Stock-specific risk- Factor covariance matrix (separate file or embedded)
File Format: CSV with standardized Barra format Update Frequency: Daily Lag: 1-day lag to prevent look-ahead bias
Purpose: Intraday 30-minute bars for QSIM and intraday strategies
Required Columns:
iclose(float): Interval close priceivwap(float): Interval VWAPivol(int): Interval volumedhigh,dlow(float): Daily high/low (broadcast to all intervals)
File Format: HDF5 with MultiIndex (timestamp, sid) Timestamps: 30-minute intervals (13 per day: 09:30-16:00) Compression: LZF or gzip for storage efficiency
Example Structure:
df = pd.read_hdf('20130115.h5', 'bars')
# MultiIndex: [(timestamp1, sid1), (timestamp1, sid2), ...]
# Columns: iclose, ivwap, ivol, dhigh, dlowPurpose: Earnings announcement dates for event avoidance/exploitation
Required Columns:
sid(int): Security identifierdate(datetime): Announcement dateeps_actual(float): Reported EPSeps_estimate(float): Consensus estimatesurprise(float): Actual - estimate
File Format: CSV with all historical earnings Coverage: Quarterly earnings for all universe stocks
Purpose: Short borrow availability and fee rates
Required Columns:
sidorSEDOL(int/str): Security identifiershares(int): Available shares to borrowfee(float): Annual borrow fee rate (%)symbol(str): Stock ticker
File Format: Pipe-delimited CSV Update Frequency: Daily or weekly Usage: Constrain short positions to available borrows
Purpose: Analyst estimates and revisions (IBES database)
Required Columns:
sid(int): Security identifiermean(float): Consensus EPS estimatemedian(float): Median estimatestd(float): Estimate standard deviationnum_estimates(int): Number of analysts
File Format: CSV with date snapshots
Data Source: I/B/E/S or equivalent analyst database
Strategies Using: analyst.py, ebs.py, prod_sal.py
- Price Range: $2.00 - $500.00
- Min ADV: $1,000,000 (average dollar volume)
- Country: USA
- Currency: USD
- Market Cap: Top 1,400 stocks (configurable via
uni_size)
- Price Range: $2.25 - $500.00
- Min ADV: $5,000,000
- Purpose: Wider universe for alpha generation, narrower for execution
- Sector Exclusions: PHARMA industry can be excluded
- Earnings Avoidance: Optional N-day window around earnings
- Locate Requirements: Short positions require borrow availability
- Data Quality: Require non-null prices, volume, Barra factors
-
Organize Directory Structure:
/data/ ├── universe/YYYY/YYYYMMDD.csv ├── prices/YYYY/YYYYMMDD.csv ├── barra/YYYY/YYYYMMDD.csv ├── bars/YYYY/YYYYMMDD.h5 ├── earnings/earnings.csv ├── locates/borrow.csv └── estimates/sal_YYYY/YYYYMMDD.csv
-
Configure Paths: Edit constants in
loaddata.py -
Validate Data:
python readcsv.py # Check data integrity -
Generate HDF5 Cache (optional):
# HDF5 cache created automatically on first load # Significantly speeds up repeated backtests
-
Test Data Loading:
from loaddata import * df = load_prices('20130101', '20130131', lookback=30) print(df.info())
The optimization module (opt.py) maximizes:
Utility = Alpha - κ(Specific Risk + Factor Risk) - Slippage - Execution Costs
Key Parameters:
kappa: Risk aversion (2e-8 to 4.3e-5)max_sumnot: Max total notional ($50M default)max_posnot: Max position size (0.48% of capital)slip_nu: Market impact coefficient (0.14-0.18)
Constraints:
- Position limits: ±$40k-$1M per stock
- Capital limits: $4-50M aggregate notional
- Participation: Max 1.5% of ADV
- Factor exposure: Limited Barra factor bets
Most comprehensive daily backtest with optimized positions:
python bsim.py \
--start=20130101 \
--end=20130630 \
--fcast=hl:1:0.5,bd:0.8:0.3,pca:1.2:0.2 \
--horizon=3 \
--kappa=2e-8 \
--maxnot=200e6 \
--locates=True \
--vwap=FalseArguments:
--start/--end: Date range (YYYYMMDD)--fcast: Alpha signals (format:name:multiplier:weight)--horizon: Forecast horizon in days--kappa: Risk aversion parameter--maxnot: Maximum notional--vwap: Use VWAP execution (default: close)
Order-level backtest with fill strategy analysis:
python osim.py \
--start=20130101 \
--end=20130630 \
--fill=vwap \
--slipbps=0.0001 \
--fcast=alpha_files30-minute bar simulation for intraday strategies:
python qsim.py \
--start=20130101 \
--end=20130630 \
--fcast=qhl_intra \
--horizon=3 \
--mult=1000 \
--slipbps=0.0001Full lifecycle with position and cash tracking:
python ssim.py \
--start=20130101 \
--end=20131231 \
--fcast=combined_alphaThe system implements 35 alpha strategies across 7 strategy families, each with multiple variants optimized for different market conditions and time horizons.
Core Concept: Exploits mean reversion of prices relative to daily high-low geometric midpoint.
Signal Formula: hl0 = close / sqrt(high * low)
Variants:
hl.py- Base strategy with daily + intraday signals, industry-demeanedhl_intra.py- Intraday-only for high-frequency tradingqhl_intra.py- Quote-level intraday with hourly coefficientsqhl_multi.py- Multi-period daily signals (1-5 day lags)qhl_both.py- Combined daily + intraday optimizationqhl_both_i.py- Industry-specific variant with sector models
Characteristics:
- Holding period: 1-3 days
- Turnover: High (daily rebalancing)
- Industry neutral via demeaning
- Negative coefficients (mean reversion)
Usage: --fcast=hl:1:1 or --fcast=qhl_intra:1.2:0.5
Core Concept: Two distinct approaches - order flow imbalance and beta-adjusted returns.
Order Flow Signal: bd = (bidqty - askqty) / (bidqty + askqty) / beta
Return Signal: badj = return / beta
Variants:
bd.py- Base order flow strategy with volume weightingbd1.py- Simplified daily order flowbd_intra.py- Intraday order flow (6 hourly periods)badj_multi.py- Multi-lag beta-adjusted returnsbadj_intra.py- Intraday return signalsbadj_both.py- Combined daily + intraday returnsbadj_dow_multi.py- Day-of-week specific modelsbadj2_multi.py- Alternative beta adjustment methodologybadj2_intra.py- Alternative intraday implementation
Characteristics:
- Exploits market microstructure inefficiencies
- Removes systematic market component via beta
- Separate Energy sector models
- Multiple time horizon variants
Usage: --fcast=bd:0.8:0.4 or --fcast=badj_multi:1:0.3
Core Concept: Fundamental signals from analyst ratings, estimates, and price targets.
Data Source: IBES (I/B/E/S) analyst database
Variants:
analyst.py- Base analyst rating/estimate changesanalyst_badj.py- Beta-adjusted analyst signalsrating_diff.py- Rating change momentum with cubed amplificationrating_diff_updn.py- Separate up/down revision models
Characteristics:
- Low frequency updates (weekly/monthly)
- Fundamental value capture
- Consensus change indicators
- Asymmetric up/down responses
Usage: --fcast=analyst:1.5:0.15
Core Concept: Event-driven strategies based on earnings surprises and analyst targets.
Variants:
eps.py- Post-earnings announcement drift (PEAD)target.py- Price target deviation signalsprod_tgt.py- Production target strategy with filtering
Characteristics:
- Event-driven (quarterly earnings)
- Exploits analyst forecast errors
- Price target revisions
- Earnings surprise magnitude
Usage: --fcast=eps:1:0.1
Core Concept: Volume-return interaction patterns with liquidity-aware position sizing.
Signal Formula: vadj = (volume/median_volume - 1) * beta_adj_return
Variants:
vadj.py- Full model with daily + intraday signalsvadj_multi.py- Daily-only multi-period versionvadj_intra.py- Intraday-only for short-term tradingvadj_pos.py- Position sizing emphasis with sign-based signalsvadj_old.py- Legacy implementation (deprecated)
Characteristics:
- Market-wide volume adjustment
- Industry neutralization
- Hourly coefficient fitting for intraday
- Execution quality focus
Usage: --fcast=vadj:1:0.2
Core Concept: Statistical decomposition to isolate stock-specific noise for mean reversion.
Variants:
pca_generator.py- Intraday PCA residual extraction (4 components)pca_generator_daily.py- Daily PCA with exp-weighted correlationrrb.py- Barra residual return betting (idiosyncratic returns)
Characteristics:
- Market-neutral by construction
- Rolling correlation matrices (10-period)
- Excludes Energy sector
- Multi-day lag combinations
Usage: --fcast=pca:1.2:0.2 or --fcast=rrb:1:0.15
Variants:
c2o.py- Close-to-open gap trading with intraday timingmom_year.py- 232-day lagged momentum (annual reversal)ebs.py- Analyst estimate revision signals (not equity borrow!)htb.py- Hard-to-borrow fee rate strategy (short squeeze detection)badj_rating.py- Beta-adjusted rating strategy
Usage: --fcast=c2o:1:0.1,mom:1:0.05
Combine multiple alphas with optimized weights:
# Multi-strategy portfolio
python bsim.py \
--start=20130101 \
--end=20130630 \
--fcast=pca:1.0:0.3,hl:1.2:0.25,bd:0.8:0.2,analyst:1.5:0.15,vadj:1:0.1Forecast Format: name:multiplier:weight
name: Strategy file name (without .py)multiplier: Scaling factor applied to alpha signalweight: Portfolio weight (should sum to 1.0)
Weight Optimization: Use bsim_weights.py for grid search over weight combinations
-
Develop Alpha: Create new strategy file with alpha calculation
- Load data using
loaddata.pyfunctions - Calculate signal (e.g., price ratios, volume patterns, fundamental ratios)
- Apply transformations (winsorization, industry demeaning)
- Load data using
-
Fit Coefficients: Use
regress.pyto fit on in-sample data- Weighted least squares regression to forward returns
- Separate in-sector / ex-sector fits if needed
- Rolling window or expanding window
-
Generate Forecasts: Apply coefficients to out-of-sample period
- Multiply signals by fitted coefficients
- Combine multiple lags/timeframes
- Output to HDF5 or CSV
-
Optimize: Run through
opt.pyto get target positions- Maximize utility function (alpha - risk - costs)
- Apply position limits and constraints
- Factor risk management via Barra model
-
Backtest: Simulate with appropriate engine
- BSIM: Daily strategies with optimization
- OSIM: Fill strategy comparison
- QSIM: Intraday strategies on 30-min bars
- SSIM: Full lifecycle with cash tracking
-
Analyze: Evaluate performance metrics
- Sharpe ratio and information ratio
- Maximum drawdown
- Factor exposures (13 Barra factors)
- Turnover and execution costs
| Engine | Use Case | Granularity | Execution Model |
|---|---|---|---|
| BSIM | Daily strategies | Daily | Optimized positions |
| OSIM | Fill analysis | Order-level | VWAP/mid/close fills |
| QSIM | Intraday strategies | 30-min bars | Time-of-day analysis |
| SSIM | Full system | Daily + intraday | Complete lifecycle |
All engines provide:
- P&L: Daily and cumulative
- Sharpe Ratio: Risk-adjusted returns
- Drawdown: Maximum peak-to-trough decline
- Turnover: Average daily trading volume
- Factor Exposures: Barra factor bets over time
- Execution Quality: Realized vs. estimated costs
Understanding the dependency graph helps navigate the codebase and debug issues.
loaddata.py (no dependencies)
↓
calc.py (imports: loaddata, util)
↓
regress.py (imports: loaddata, calc, util)
↓
Alpha Strategy Files (imports: loaddata, calc, regress, util)
↓
opt.py (imports: loaddata, calc, util)
↓
Simulation Engines (imports: loaddata, calc, opt, util, regress)
loaddata.py- Pure data loading, only external librariesutil.py- Helper functions, minimal dependencies
calc.py- Imports: loaddata, util- Functions:
calc_vol,calc_forward_rets,calc_factors,calc_intra_factors - Used by: Nearly all modules
- Functions:
-
regress.py- Imports: loaddata, calc, util- Functions:
regress_alpha,regress_factors,regress_daily_multi - Used by: All alpha strategies
- Functions:
-
pca.py- Imports: loaddata, calc, util- Functions:
calc_pca_daily,calc_pca_intra - Used by: pca_generator strategies
- Functions:
All strategy files import the core stack:
from loaddata import *
from calc import *
from regress import *
from util import *Strategy Groups:
- High-Low:
hl.py,hl_intra.py,qhl_*.py - Beta-Adjusted:
bd.py,bd1.py,bd_intra.py,badj_*.py - Analyst:
analyst.py,analyst_badj.py,rating_diff*.py - Earnings:
eps.py,target.py - Volume:
vadj*.py - Other:
c2o.py,mom_year.py,ebs.py,htb.py,rrb.py
-
opt.py- Imports: loaddata, calc, util- Functions:
optimize_cplex,optimize_alpha - Used by: All simulation engines
- External: OpenOpt, FuncDesigner
- Functions:
-
bsim_weights.py- Imports: loaddata, opt, util- Functions: Grid search weight optimization
- Uses: BSIM engine internally
-
bsim.py- Imports: loaddata, calc, regress, opt, util- Main simulation orchestrator
- Loads alpha forecasts from strategy outputs
-
osim.py- Imports: loaddata, calc, opt, util- Order-level execution analysis
-
qsim.py- Imports: loaddata, calc, opt, util, regress- Intraday bar simulation
-
ssim.py- Imports: loaddata, calc, regress, opt, util- Full lifecycle tracking
prod_sal.py- Analyst estimate production pipelineprod_eps.py- Earnings signal production pipelineprod_rtg.py- Rating signal production pipelineprod_tgt.py- Target signal production pipeline
-
Circular Dependencies: None identified in current codebase
- Clean hierarchical structure prevents cycles
-
Wildcard Imports: Common pattern
from module import *- Used throughout for convenience
- Be aware of namespace pollution
- Key functions documented in each module
-
Module Loading Order:
# Correct order for manual imports import loaddata import util import calc import regress import opt # Then strategy or simulation modules
-
External Dependencies:
- OpenOpt/FuncDesigner: Only in
opt.pyandbsim_weights.py - scikit-learn: Used in
calc.pyfor PCA - statsmodels: Used in
regress.pyfor WLS - MySQL: Only in
loaddata.pyif using SQL data sources
- OpenOpt/FuncDesigner: Only in
Simplified Structure (no cross-dependencies with main codebase):
salamander/loaddata.py (standalone)
↓
salamander/calc.py
↓
salamander/regress.py
↓
salamander/opt.py
↓
salamander/simulation.py
↓
salamander/bsim.py, osim.py, qsim.py, ssim.py
Key Difference: Salamander uses simulation.py as shared library instead of duplicating simulation code.
Edit in loaddata.py:
# Tradable universe
t_low_price = 2.0
t_high_price = 500.0
t_min_advp = 1000000.0 # $1M min ADV
# Expandable universe
e_low_price = 2.25
e_high_price = 500.0
e_min_advp = 5000000.0 # $5M min ADV
# Universe size
uni_size = 1400 # Top N by market capEdit in opt.py:
max_sumnot = 50.0e6 # $50M max notional
max_posnot = 0.0048 # 0.48% max per position
kappa = 4.3e-5 # Risk aversion
# Slippage model
slip_alpha = 1.0 # Base cost
slip_beta = 0.6 # Participation power
slip_delta = 0.25 # Participation coefficient
slip_nu = 0.14 # Market impact
execFee = 0.00015 # 1.5 bps execution feeEdit in calc.py:
BARRA_FACTORS = ['country', 'growth', 'size', 'sizenl',
'divyild', 'btop', 'earnyild', 'beta',
'resvol', 'betanl', 'momentum', 'leverage',
'liquidty']
PROP_FACTORS = ['srisk_pct_z', 'rating_mean_z']The codebase contains 88 Python files (~35,000 lines) organized into the following categories:
loaddata.py(1,135 lines) - Data loading from CSV/SQL, universe filtering, HDF5 cachingcalc.py(1,401 lines) - Forward returns, volume profiles, winsorization, Barra factor calculationsregress.py(489 lines) - Weighted least squares regression for alpha factor fittingutil.py(585 lines) - Data merging, filtering, and I/O helper functions
bsim.py(730 lines) - Daily rebalancing backtest with portfolio optimizationosim.py(640 lines) - Order-level execution simulator with fill strategy analysisqsim.py(531 lines) - Intraday 30-minute bar simulation for high-frequency strategiesssim.py(585 lines) - Full lifecycle simulator with position and cash tracking
opt.py(707 lines) - OpenOpt NLP solver with factor risk and transaction costsbsim_weights.py(246 lines) - Multi-alpha weight optimization using grid searchpca.py(307 lines) - Principal component decomposition for market-neutral returns
High-Low Mean Reversion (6 files):
hl.py(372 lines) - Base high-low strategy with daily + intraday signalshl_intra.py(183 lines) - Intraday-only variantqhl_intra.py(183 lines) - Quote-level intraday variantqhl_multi.py(161 lines) - Multi-period daily signalsqhl_both.py(181 lines) - Combined daily + intradayqhl_both_i.py(182 lines) - Industry-specific variant
Beta-Adjusted Order Flow (9 files):
bd.py(734 lines) - Base beta-adjusted strategy with order imbalancebd1.py(166 lines) - Simplified daily variantbd_intra.py(215 lines) - Intraday order flow signalsbadj_multi.py(165 lines) - Multi-lag return-based variantbadj_intra.py(139 lines) - Intraday return variantbadj_both.py(180 lines) - Combined daily + intraday returnsbadj_dow_multi.py(165 lines) - Day-of-week specific modelbadj2_multi.py(165 lines) - Alternative beta adjustmentbadj2_intra.py(139 lines) - Alternative intraday variant
Analyst Signals (4 files):
analyst.py(313 lines) - Base analyst rating/estimate strategyanalyst_badj.py(306 lines) - Beta-adjusted analyst signalsrating_diff.py(222 lines) - Rating change momentumrating_diff_updn.py(189 lines) - Separate up/down revisions
Earnings & Valuation (3 files):
eps.py(146 lines) - Post-earnings announcement drift (PEAD)target.py(218 lines) - Analyst price target deviationsprod_tgt.py(252 lines) - Production target strategy
Volume-Adjusted (5 files):
vadj.py(226 lines) - Base volume-return interaction strategyvadj_multi.py(204 lines) - Daily-only multi-period variantvadj_intra.py(146 lines) - Intraday volume signalsvadj_pos.py(197 lines) - Position sizing emphasisvadj_old.py(155 lines) - Legacy implementation (deprecated)
PCA & Residuals (3 files):
pca_generator.py(80 lines) - Intraday PCA residual extractionpca_generator_daily.py(81 lines) - Daily PCA with exponential weightingrrb.py(157 lines) - Barra residual return betting
Other Specialized (5 files):
c2o.py(217 lines) - Close-to-open gap tradingmom_year.py(92 lines) - 232-day momentum strategyebs.py(221 lines) - Analyst estimate revision signalshtb.py(116 lines) - Hard-to-borrow fee rate strategybadj_rating.py(Unknown) - Beta-adjusted rating strategy
prod_sal.py(297 lines) - Production estimate signal generatorprod_eps.py(336 lines) - Production earnings signal generatorprod_rtg.py(311 lines) - Production rating signal generatorprod_tgt.py(252 lines) - Production target signal generator
bigsim_test.py- Testing framework for bsimosim_simple.py- Simplified order simulatorosim2.py- Alternative order simulatorreadcsv.py- CSV data validationdumpall.py- Bulk data export utilityfactors.py- Factor analysis utilitiesslip.py- Slippage model testingsetup.py- Cython build configuration- Additional utilities:
new1.py,other.py,other2.py,rev.py,bsz.py,bsz1.py,load_data_live.py
Core Infrastructure (5 files):
loaddata.py(287 lines) - CSV-based data loadingloaddata_sql.py(317 lines) - SQL database integrationcalc.py(473 lines) - Factor calculations (simplified)regress.py(165 lines) - Regression fittingutil.py(260 lines) - 18 utility functions
Simulation Engines (4 files):
bsim.py(260 lines) - Standalone daily simulatorosim.py(224 lines) - Standalone order simulatorqsim.py(409 lines) - Standalone intraday simulatorssim.py(271 lines) - Standalone lifecycle simulatorsimulation.py(382 lines) - Core simulation library
Optimization (1 file):
opt.py(379 lines) - Simplified portfolio optimization
Workflow Generators (3 files):
gen_dir.py(17 lines) - Directory structure generatorgen_hl.py(23 lines) - HL signal generatorgen_alpha.py(32 lines) - Alpha file extractor
Strategy Implementations (2 files):
hl.py(124 lines) - High-low prototypehl_csv.py(153 lines) - Production HL with CSV data
Utilities & Validation (9 files):
change_hl.py(12 lines) - HDF5 date format convertercheck_hl.py(25 lines) - HL signal validatorcheck_all.py(19 lines) - HDF5 dataset inspectorchange_raw.py(118 lines) - Raw data augmentationmktcalendar.py(20 lines) - US trading calendarget_borrow.py(18 lines) - Borrow rate aggregatorshow_borrow.py(9 lines) - Borrow data inspectorshow_raw.py(25 lines) - Raw data inspectorREADME.md(451 lines) - Comprehensive module documentation
statarb/
├── README.md # Project overview and guide
├── CLAUDE.md # AI assistant instructions
├── LOG.md # Documentation changelog
├── requirements.txt # Python 2.7 dependencies
├── setup.py # Cython optimization build
│
├── Core Infrastructure/
│ ├── loaddata.py # Data loading & universe filtering
│ ├── calc.py # Returns & factor calculations
│ ├── regress.py # Alpha coefficient fitting
│ ├── util.py # Helper functions
│
├── Simulation Engines/
│ ├── bsim.py # Daily rebalancing backtest
│ ├── osim.py # Order-level execution
│ ├── qsim.py # Intraday 30-min bars
│ ├── ssim.py # Full lifecycle tracking
│
├── Portfolio Optimization/
│ ├── opt.py # NLP optimizer
│ ├── bsim_weights.py # Weight optimization
│ ├── pca.py # PCA decomposition
│
├── Alpha Strategies/
│ ├── High-Low/ # 6 mean reversion variants
│ ├── Beta-Adjusted/ # 9 order flow variants
│ ├── Analyst/ # 4 fundamental signal variants
│ ├── Earnings/ # 3 event-driven variants
│ ├── Volume/ # 5 liquidity-aware variants
│ ├── PCA/ # 3 residual variants
│ └── Other/ # 5 specialized strategies
│
├── Production/
│ ├── prod_sal.py # Estimate signal production
│ ├── prod_eps.py # Earnings signal production
│ ├── prod_rtg.py # Rating signal production
│ └── prod_tgt.py # Target signal production
│
├── Testing & Utilities/ # ~13 testing/validation scripts
│
├── plan/ # Documentation plans (11 files)
│
└── salamander/ # Python 3 standalone module
├── README.md # Module documentation
├── requirements.txt # Python 3 dependencies
├── Core/ # 6 infrastructure files
├── Simulation/ # 5 engine files
├── Generators/ # 3 workflow files
├── Strategies/ # 2 HL implementations
└── Utilities/ # 9 validation scripts
The salamander/ directory contains a standalone, simplified version of the system for easier deployment and development.
- Modular directory structure
- Simplified alpha generation pipeline
- Standalone backtest engine
- Documented workflow in
instructions.txt
# 1. Create directory structure
python3 salamander/gen_dir.py --dir=/path/to/data
# 2. Generate alpha signals from raw data
python3 salamander/gen_hl.py \
--start=20100630 \
--end=20130630 \
--dir=/path/to/data
# 3. Create alpha signal files
python3 salamander/gen_alpha.py \
--start=20100630 \
--end=20130630 \
--dir=/path/to/data
# 4. Run backtest
python3 salamander/bsim.py \
--start=20130101 \
--end=20130630 \
--dir=/path/to/data \
--fcast=hl:1:1data/
├── all/ # Alpha signal files
├── hl/ # High-low strategy files
├── locates/ # Short borrow data (borrow.csv)
├── opt/ # Optimization outputs
├── blotter/ # Trade records
├── raw/ # Raw market data
└── all_graphs/ # Visualization outputs
Four dedicated production modules generate alpha signals for live trading:
-
prod_sal.py- Analyst Estimate Signals- Data: I/B/E/S analyst estimates
- Signal: Estimate revisions and dispersion
- Frequency: Daily updates
- Output:
salforecast column
-
prod_eps.py- Earnings Signals- Data: Earnings announcements and surprises
- Signal: Post-earnings announcement drift (PEAD)
- Frequency: Quarterly (event-driven)
- Output:
epsforecast column
-
prod_rtg.py- Rating Signals- Data: Analyst rating changes
- Signal: Rating revisions with cubed amplification
- Frequency: Event-driven (rating changes)
- Output:
rtgforecast column
-
prod_tgt.py- Price Target Signals- Data: Analyst price targets
- Signal: Target deviations from current price
- Frequency: Daily updates
- Output:
tgtforecast column
1. Data Ingestion
├── Download IBES estimates → ESTIMATES_BASE_DIR
├── Download earnings data → EARNINGS_BASE_DIR
├── Download price/volume → PRICE_BASE_DIR
└── Download Barra factors → BARRA_BASE_DIR
2. Signal Generation (Daily)
├── prod_sal.py --start=TODAY --end=TODAY → all/sal.h5
├── prod_eps.py --start=TODAY --end=TODAY → all/eps.h5
├── prod_rtg.py --start=TODAY --end=TODAY → all/rtg.h5
└── prod_tgt.py --start=TODAY --end=TODAY → all/tgt.h5
3. Portfolio Optimization
└── opt.py with combined forecasts → target positions
4. Order Generation
└── Compare target vs current positions → order list
5. Execution
└── Send orders to broker/execution system
6. Monitoring
├── Track fill prices vs estimates
├── Monitor factor exposures
└── Calculate realized P&L
# Production universe (more conservative)
t_low_price = 5.0 # Higher min price
t_high_price = 500.0
t_min_advp = 2000000.0 # Higher liquidity requirement
uni_size = 1000 # Smaller universe (top 1000)
# Expandable universe
e_min_advp = 10000000.0 # Much higher for signal generation# Production risk controls
kappa = 4.3e-5 # Conservative risk aversion
max_sumnot = 50.0e6 # $50M capital
max_posnot = 0.0048 # Max 0.48% per position
max_participation = 0.015 # Max 1.5% of ADV
# Realistic transaction costs
slip_nu = 0.18 # Higher market impact
execFee = 0.00015 # 1.5 bps execution fee# Alert thresholds
MAX_DRAWDOWN = 0.05 # 5% max drawdown
MAX_FACTOR_EXPOSURE = 0.5 # Max factor bet
MAX_INDUSTRY_EXPOSURE = 0.1 # Max 10% in one industry
MIN_SHARPE = 1.5 # Minimum acceptable Sharpe- Verify data freshness (prices, Barra, estimates)
- Run production signal generators (4 modules)
- Check signal distributions (no extreme outliers)
- Run portfolio optimization
- Review target positions vs. current
- Validate factor exposures (neutral to Barra factors)
- Generate order list
- Review slippage estimates
- Execute orders
- Monitor fills and update positions
- Calculate EOD P&L
- Archive results and logs
- Analyze Sharpe ratio trend
- Review factor exposures over time
- Check alpha decay (are signals still predictive?)
- Validate transaction cost estimates vs. realized
- Review largest winners/losers
- Update universe (corporate actions, delistings)
- Refit regression coefficients (out-of-sample drift)
- Backtest recent period (validation)
- Review alpha combination weights
- Analyze strategy attribution (which alphas contributing?)
- Update risk model (recalculate factor covariances)
-
Data Quality:
- Validate all data sources before optimization
- Check for missing values, outliers, stale data
- Maintain audit trail of data versions
-
Signal Validation:
- Monitor signal distributions (mean, std, extremes)
- Check for regime changes or structural breaks
- Compare current vs. historical signal characteristics
-
Risk Management:
- Hard position limits in optimizer (cannot be exceeded)
- Real-time factor exposure monitoring
- Circuit breakers for extreme market conditions
- Diversification across strategy families
-
Execution Quality:
- Compare realized vs. estimated slippage
- Track implementation shortfall
- Monitor adverse selection in fills
- Analyze execution timing (VWAP vs. close)
-
System Reliability:
- Automated data pipeline with fallbacks
- Redundant optimization runs (validate consistency)
- Alert system for failures or anomalies
- Manual review gate before order submission
-
Performance Attribution:
- Decompose P&L by strategy family
- Track alpha vs. risk vs. costs
- Identify alpha decay patterns
- Adjust weights based on recent performance
-
Data Loss:
- Maintain backups of all historical data
- Cache critical files (Barra, universe, prices)
- Document data provider contact info
-
System Failure:
- Manual override process documented
- Backup optimization environment
- Position reconciliation procedures
-
Market Events:
- Halt trading triggers (volatility spike, flash crash)
- Emergency liquidation protocol
- Risk override procedures
The system evaluates strategies using:
- Sharpe Ratio: Risk-adjusted returns (annualized)
- Information Ratio: Alpha vs. benchmark volatility
- Maximum Drawdown: Largest peak-to-trough decline
- Turnover: Average daily trading as % of capital
- Hit Rate: Percentage of profitable days
- Factor Exposures: Bets on Barra risk factors
- Participation Rate: Trading volume vs. ADV
- Factor Neutrality: Limits on Barra factor exposures
- Sector Limits: Industry concentration constraints
- Position Sizing: Market cap and liquidity-based limits
- Participation Constraints: Max 1.5% of ADV to minimize impact
- Correlation Monitoring: Rolling 30-day cross-security correlations
To create a new alpha signal:
- Create a new Python file (e.g.,
my_alpha.py) - Load data using
loaddata.pyfunctions - Calculate your alpha signal
- Use
regress.pyto fit coefficients on training data - Generate out-of-sample forecasts
- Save to HDF5 or CSV for simulation engines
Example structure:
from loaddata import *
from calc import *
from regress import *
# Load data
daily_df = load_prices(start, end, lookback)
barra_df = load_barra(start, end, lookback)
# Calculate alpha
daily_df['my_alpha'] = calculate_my_signal(daily_df)
# Fit regression
fits_df = regress_alpha(daily_df, 'my_alpha', horizon=3)
# Generate forecast
forecast_df = apply_coefficients(daily_df, fits_df)
# Save results
dump_alpha(forecast_df, 'my_alpha')Combine multiple alphas with optimized weights:
python bsim.py \
--start=20130101 \
--end=20130630 \
--fcast=pca:1.0:0.3,hl:1.2:0.25,bd:0.8:0.2,analyst:1.5:0.15,mom:1.0:0.1Weights should sum to 1.0 for proper risk attribution.
The system models realistic costs:
- Execution Fees: 1.5 bps fixed
- Slippage: Nonlinear function of participation rate
- Market Impact: Based on order size vs. ADV
- Borrow Costs: For short positions
- Opportunity Cost: From delayed fills
Analyze realized vs. estimated costs using OSIM engine.
-
Python 2.7 End of Life:
- Main codebase stuck on Python 2.7 due to OpenOpt dependency
- OpenOpt no longer maintained (last update: 2014)
- Mitigation: Salamander module provides Python 3 path
- Long-term: Migrate to cvxpy, scipy.optimize, or commercial solver
-
Incomplete Implementations (Now Resolved):
- ✅
pca_generator.py- Residual calculation fixed (2026-02-05) - ✅
pca_generator_daily.py- Residual extraction enabled (2026-02-05) - Status: PCA residual strategies now fully functional
- ✅
-
Code Bugs (Now Resolved):
- ✅ Fixed 7 bugs in beta-adjusted strategies (2026-02-05)
- Variable naming errors in bd1.py
- Syntax errors in badj_intra.py
- Undefined variable references in badj2_multi.py, badj2_intra.py
- ✅ Fixed 2 bugs in hl_intra.py (2026-02-05)
- Empty DataFrame overwrite causing KeyError
- Undefined variable 'lag' causing NameError
- Status: All documented runtime bugs fixed
- ✅ Fixed 7 bugs in beta-adjusted strategies (2026-02-05)
-
Misleading Filenames:
ebs.py- Actually analyst estimates, not equity borrow- Clarification: "SAL" = Analyst estimates, not short availability
- Action: Consider renaming to
sal.pyfor clarity
-
Wildcard Imports:
- Extensive use of
from module import * - Makes dependency tracking difficult
- Potential namespace pollution
- Best Practice: Use explicit imports in new code
- Extensive use of
-
Limited Error Handling:
- Many functions lack try/except blocks
- Data validation minimal in some modules
- Can fail silently on bad data
- Improvement: Add data quality checks and error logging
-
Documentation Gaps (Now Resolved):
- ✅ 78 files documented with comprehensive docstrings
- ✅ All strategy families documented
- ✅ All simulation engines documented
- Remaining: Some utility/test files have minimal docs
-
Hard-Coded Paths:
- Some utility scripts have hard-coded file paths
- Examples:
salamander/change_hl.py,salamander/show_borrow.py - Fix: Convert to command-line arguments
-
HDF5 Caching:
- First load of data slow (CSV parsing)
- HDF5 cache significantly speeds up subsequent loads
- Cache invalidation manual (delete .h5 files)
- Improvement: Automatic cache invalidation based on source file mtime
-
Large Memory Footprint:
- Loading full universe (1,400 stocks) for date range can exceed 8GB RAM
- Bar data especially memory-intensive
- Mitigation: Process in smaller date chunks
-
Optimization Speed:
- OpenOpt solver can be slow (1,500 iterations)
- Daily optimization takes 5-30 seconds depending on universe size
- Alternative: Commercial solvers (Gurobi, CPLEX) much faster
-
Barra Factor Availability:
- Requires proprietary Barra data subscription
- No public alternative readily available
- Workaround: Can use Fama-French factors or custom factor models
-
IBES Database:
- Analyst data requires expensive IBES subscription
- Analyst strategies non-functional without it
- Alternative: Use free alternatives (Yahoo Finance estimates, limited)
-
Corporate Actions:
- Split handling implemented but limited testing
- Dividend adjustments may have edge cases
- Testing Needed: Comprehensive corporate action test suite
-
Single-Threaded:
- Most modules single-threaded (Python GIL)
- Parallel processing limited
- Improvement: Multiprocessing for cross-sectional calculations
-
Universe Size:
- Optimized for ~1,400 stocks
- Larger universes (3,000+) may hit memory/speed limits
- Scaling: Batch processing or distributed computing
-
Backtest Length:
- Multi-year backtests can take hours
- Output files can exceed 1GB
- Optimization: Incremental backtests, parallel date ranges
-
No Authentication:
- No user authentication system
- No access controls on data directories
- Production: Implement role-based access control
-
No Audit Trail:
- Limited logging of operations
- No trade audit trail
- Production: Comprehensive logging and audit system required
-
No Backtesting Safeguards:
- Easy to introduce look-ahead bias
- Data leakage possible in alpha development
- Best Practice: Strict in-sample/out-of-sample discipline
Several files have unclear purpose and need investigation:
new1.py- Unknown purposeother.py,other2.py- Unclear functionalityrev.py- Likely reversal strategy, undocumentedbsz.py,bsz1.py- Unknown (batch size related?)osim2.py- Alternative osim implementation?badj_rating.py- Beta-adjusted ratings (incomplete?)factors.py- Factor analysis utilities?slip.py- Slippage testing?
Action Required: Code review and documentation or removal if obsolete.
This is a research codebase under active documentation. Key areas for improvement:
- Python 3 Migration: Replace OpenOpt with modern solver (cvxpy, scipy)
- Testing Suite: Add unit tests and integration tests
- Data Pipeline: Modernize data ingestion (APIs instead of CSV)
- Performance: Parallelize cross-sectional calculations
- Documentation: Complete remaining utility scripts
- Additional Alpha Strategies: Machine learning alpha generation
- Enhanced Optimization: Multi-period optimization, convex relaxations
- Real-Time Data: Streaming market data integration
- Execution: Improved execution modeling (order book dynamics)
- Monitoring: Real-time dashboards and alerting
- Web Interface: Dashboard for backtest visualization
- Cloud Deployment: Containerization (Docker) and cloud infrastructure
- Alternative Data: Incorporate sentiment, satellite, web scraping
- Risk Models: Support for alternative factor models (Fama-French, custom)
- Follow PEP 8 for new code (legacy code may not comply)
- Add comprehensive docstrings (NumPy style)
- Include usage examples in docstrings
- Use explicit imports instead of wildcards
- Add type hints in Python 3 code (salamander module)
- Write unit tests for new functions
- Document data requirements and assumptions
Apache 2.0
For questions and support, please open an issue on GitHub.
Disclaimer: This system is for research and educational purposes. Use at your own risk. Past performance does not guarantee future results. Trading involves substantial risk of loss.