Alpha Engine is a reinforcement learning system that automatically discovers interpretable financial alpha expressions. Rather than training a black-box predictive model, the agent learns to compose symbolic formulas — human-readable mathematical expressions — that serve as trading signals (alpha factors).
The core loop works as follows:
- The RL agent observes the current state (stack contents, market regime features).
- It selects a token from a Domain-Specific Language (DSL): an operand (
Close,Volume), a constant (2,0.5), an operator (sub,ts_mean), or a pre-built alpha primitive (alpha_mom,alpha_rsi). - Tokens are applied to a stack in Reverse Polish Notation (RPN) style, progressively building an expression tree.
- When the stack contains exactly one complete expression, the episode ends and the expression is evaluated on historical OHLCV data.
- The reward is the annualized Sharpe ratio of the strategy implied by the expression, averaged across multiple tickers.
The result is a library of interpretable formulas like:
sub(ts_mean(Close, 5), ts_mean(Close, 20)) → short-term vs long-term momentum
alpha_mean_rev(10) → 10-day mean reversion signal
mul(alpha_mom(20), neg(alpha_vol_surge(10))) → momentum filtered by volume
These expressions can be evaluated directly on new data, inspected by a human, and composed into trading strategies.
PIPELINE_CLEAN/
│
├── DSL/ # Domain-Specific Language
│ ├── utils.py # Token types, operators, math functions, alpha primitives
│ ├── node.py # ExpressionNode — tree node with recursive evaluation
│ ├── expression.py # Expression — wrapper around root node
│ └── dsl.py # DSL — registry of all available tokens
│
├── agent/
│ └── agentPPO_enhanced.py # PPO agent with GAE, entropy regularization, multi-episode batching
│
├── environment/
│ └── multi_env_v2.py # Multi-ticker market-aware environment (production-aligned)
│
├── train/
│ └── train_PPO_enh_multi_v2.py # Training script for V2 environment
│
├── alpha_trading/ # Post-training alpha management
│ ├── alpha_record.py # AlphaRecord dataclass (expression + performance metadata)
│ ├── alpha_library.py # AlphaLibrary — store, load, rank, evaluate alphas
│ ├── generate_signals.py # Evaluate alphas on new data → trading signals
│ └── save_alphas.py # Persist best alphas from training history
│
├── simulations/
│ └── animate_episode.py # Animate agent inference and stack building
│
├── data_pipeline/ # Data acquisition utilities
│ ├── download_sp500_history.py # Download S&P 500 historical components from GitHub
│ ├── sp500_tickers.py # Extract S&P 500 constituents at a historical date
│ └── download_ohlcv.py # Download OHLCV data via yfinance
│
├── intra_tests/ # Unit/integration tests
│ ├── dsl_test.py # Tests for DSL tokens and operators
│ ├── expr_test.py # Tests for expression building and evaluation
│ ├── env_test.py # Tests for environments
│ ├── agentPPO_test.py # Tests for PPO agent
│ └── agentREINF_test.py # Tests for REINFORCE agent
│
├── data/
│ └── ohlcv_data.csv # Historical OHLCV data (extracted with download script)
| └── data_utils_info&components/ # info and utils used for data developing
│ └── test/
| └── TEST_ohlcv_data.csv #test ohlcv data for inference
| └── train/
| └── ohlcv_data.csv #train ohlcv data
├── alphas/
│ └── batch_v2_alpha_library.json # Discovered alpha expressions from training
│
├── models/ # Saved agent checkpoints and training histories
│ ├── agent_batch_v2_primitives.pt
│ └── batch_v2_training_history.pkl
│
├── figures/ # Output directory for visualization plots
└── results/ # Output directory for analysis results
The DSL defines all the building blocks the agent can use to construct expressions.
Defines the fundamental types:
| Token Type | Description | Examples |
|---|---|---|
OPERAND |
Raw OHLCV columns | Close, Open, High, Low, Volume |
CONSTANT |
Numeric literals | -1, 0, 0.5, 1, 2 |
UNARY_OP |
Single-input transformations | neg(x), abs(x), log(x), sign(x) |
BINARY_OP |
Two-input combinations | add(x,y), sub(x,y), mul(x,y), div(x,y), max(x,y), min(x,y) |
TS_OP |
Time-series operators with a window | ts_mean(x,d), ts_std(x,d), delay(x,d), delta(x,d), ts_rank(x,d), ts_zscore(x,d), returns(x,d) |
Additionally, known-good alpha primitives are defined as standalone functions:
- Momentum family:
momentum,volatility_adjusted_momentum,price_acceleration - Mean reversion family:
mean_reversion,relative_strength,bollinger_position - Volume family:
volume_price_trend,volume_surge,price_volume_divergence - Price action family:
overnight_gap,intraday_range,close_location
All math functions include safety guards (safe_div returns 0 instead of dividing by zero, safe_log handles non-positive inputs).
ExpressionNode is a recursive tree node. Each node holds:
- A
Token(what operation or data it represents) - A list of
children(sub-expressions, empty for leaf nodes) - An optional
windowparameter (for time-series operators)
Key methods:
evaluate(data)— Recursively evaluates the subtree on a DataFrame, dispatching by token type.to_string()— Produces a human-readable formula likets_mean(sub(Close, Open), 5).depth()/size()— Measure expression complexity.
Alpha primitives are treated as special leaf nodes: they have no children in the tree but internally pull multiple columns from the DataFrame and apply their own logic.
Expression wraps a root ExpressionNode and provides a clean API:
evaluate(data) → pd.Series— Evaluates the full tree, replaces infinities and NaNs with 0.to_string()— Formula string.complexity()— Node count (can be used as a regularization penalty).
A helper function build_expression(dsl, token_name, children, window) simplifies manual construction.
The DSL class is the central registry that holds all available tokens. On initialization, it creates:
- 5 operands, 5 constants, 4 unary ops, 6 binary ops, 9 time-series ops
- 12 alpha primitives via
CompoundAlphaOperator— these are pre-built factors encapsulating known-good alpha logic (e.g., RSI, Bollinger position, volume-price trend). They accept a window parameter and internally extract the required OHLCV columns.
CompoundAlphaOperator extends Token with token_type = TS_OP so it fits into the existing action space, but it pulls data directly from the DataFrame rather than operating on child nodes.
Methods:
get_all_tokens()— Returns all tokens grouped by category (includingalpha_primitives).get_token(name)— Lookup a token by name.get_action_space()— Flat list of all token names for the RL agent.get_token_info()— Summary table of all tokens as a DataFrame.
The agent uses Proximal Policy Optimization (PPO) with several enhancements over vanilla implementations.
A shared-trunk architecture with two heads:
State (dim=26) → [Linear → LayerNorm → ReLU] × 2 → Shared Features (dim=H)
├─→ Actor Head → Action Probabilities (dim=A)
└─→ Critic Head → State Value (dim=1)
Both heads have an additional hidden layer (H/2 neurons) for increased representational capacity. An action mask is applied to the actor logits before softmax, setting invalid actions to -inf so they receive zero probability.
| Feature | Description |
|---|---|
| GAE (Generalized Advantage Estimation) | Computes advantages using the recursive formula gae_lambda (default 0.95) controls the bias-variance trade-off. |
| Entropy Regularization | An entropy bonus is added to the loss, encouraging the policy to maintain exploration and avoid premature convergence on a narrow set of formulas. |
| Multi-Episode Batching | The agent collects batch_episodes complete episodes before performing a gradient update, providing richer and more stable gradient signals. |
| Clipped Surrogate Objective | The standard PPO clipping with eps_clip (default 0.2), preventing the policy from changing too aggressively in a single update. |
| Gradient Clipping |
clip_grad_norm_ with max_norm=0.5 prevents exploding gradients. |
The combined loss is:
where
Interface (compatible with a REINFORCE agent for easy swapping):
select_action(state, action_mask) → intstore_reward(reward)/store_done(done)should_update() → bool— ReturnsTruewhenbatch_episodesepisodes have been collected.update() → float— Runsk_epochsupdate passes on the buffered data, returns average loss.save(path)/load(path)
The production-aligned multi-ticker environment used by the main training script.
The agent builds an expression that is evaluated on multiple tickers simultaneously. A random date is sampled per episode, defining the market regime observed by the agent. The key insight: an alpha that works across ~19 diverse stocks is more likely to be real signal than one tuned to a single stock.
Key design features:
| Aspect | Description |
|---|---|
| Episode sampling | Random date → defines market state; reward uses rolling Sharpe over the full evaluation range |
| State | 16 market regime features + 10 stack features = 26 dimensions |
| Market awareness | Short (5d) / Medium (20d) / Long (60d) cross-sectional regime features |
| Alpha primitives | Supported as leaf-node actions |
| Window selection | Explicit window-selection actions: 2, 3, 5, 10, 20, 50, 120 days |
| Expression depth | Minimum 3 steps (MIN_EXPRESSION_DEPTH), maximum 20 steps |
State vector (26 dimensions):
SHORT-TERM (5-day, 6 features):
avg_momentum, momentum_dispersion, avg_volatility,
avg_mean_reversion, avg_volume_ratio, pct_positive_return
MEDIUM-TERM (20-day, 6 features):
avg_momentum, momentum_dispersion, avg_volatility,
avg_mean_reversion, avg_volume_ratio, pct_positive_return
LONG-TERM (60-day, 4 features):
avg_momentum, momentum_dispersion, avg_volatility,
avg_mean_reversion
STACK STATE (10 features):
stack_size, current_step, normalized_window, n_operands, n_constants,
n_unary, n_binary, n_ts, n_alpha_primitives, steps_remaining
All market features are computed cross-sectionally (averaged across all tickers at the sampled date), giving the agent a read on the overall market regime.
Episode lifecycle:
- Sample a random date from the valid range.
- Compute market features at that date (state observation).
- Agent builds an expression by selecting tokens + window actions step by step.
- When the stack has exactly one complete expression (and ≥3 steps taken), the episode terminates.
- The expression is evaluated via rolling Sharpe ratio across all tickers over the full evaluation range.
Reward computation — Rolling Sharpe:
For each ticker:
- Evaluate the expression on the ticker's full historical data → raw alpha signal.
- Apply 60-day rolling rank normalization: convert the signal to a percentile rank within a 60-day window, then scale to
$[-1, 1]$ . - Compute strategy returns = normalized alpha × forward returns, where forward return on day
$T$ =$(Close_{T+1} - Open_{T+1}) / Open_{T+1}$ . - Compute annualized Sharpe =
$\frac{\mu}{\sigma} \times \sqrt{252}$ , clipped to$[-10, 10]$ . - Average Sharpe across all tickers, with a consistency adjustment: if >70% of tickers have positive Sharpe → reward ×1.10; if <30% → reward ×0.85. Final reward clipped to
$[-5, 5]$ .
Structural guard: Expressions that don't reference any OHLCV column or alpha primitive receive a
Performance optimizations:
- All per-ticker features (at 5/20/60-day timeframes) and forward returns are precomputed once at initialization.
- Common dates across tickers are pre-identified.
- Structural checks gate expensive Sharpe computation.
The main training loop for the V2 environment.
Default hyperparameters (as used in the __main__ block):
| Parameter | Value | Rationale |
|---|---|---|
n_episodes |
150,000 | Large budget for exploring the combinatorial expression space |
max_steps |
20 | Allows expressions of moderate depth |
hidden_size |
256 | Larger network to handle the 26-dim state |
learning_rate |
0.0001 | Low LR for stability with PPO |
batch_episodes |
64 | Collects 64 episodes per PPO update for stable gradients |
entropy_coef |
0.10 (decaying) | Starts high for exploration, decays to 0.02 |
gamma |
0.99 | Standard discount factor |
gae_lambda |
0.95 | Standard GAE parameter |
Training tickers (19 diversified stocks):
| Sector | Tickers |
|---|---|
| Technology | AAPL, MSFT, GOOGL, INTC, CSCO |
| Finance | JPM, BAC, GS |
| Healthcare | JNJ, PFE, UNH |
| Consumer | WMT, KO, PG |
| Industrial / Aerospace | GE, CAT, BA |
| Energy | XOM, CVX |
Training features:
- Entropy decay:
entropy_coeflinearly decays over training from its initial value toward 0.02, shifting from exploration to exploitation. - Expression deduplication: After 10 repeats of the same expression, a mild penalty (scaling down to ×0.5) is applied to encourage diversity.
- Comprehensive logging: Tracks rewards, expressions, per-ticker Sharpes, episode dates, and market features.
- Alpha library: Top 30 expressions are saved to a JSON-based
AlphaLibrary. - Market regime analysis: Post-training, episodes are grouped by market regime (trending / volatile / calm) to see if different conditions produce different alpha families.
Output:
- Saved agent checkpoint:
models/agent_batch_v2_primitives.pt - Full training history:
models/batch_v2_training_history.pkl - Alpha library:
alphas/batch_v2_alpha_library.json
After training, discovered alphas are stored, evaluated, and converted to trading signals.
-
alpha_record.py—AlphaRecorddataclass holding expression string, Sharpe ratio, mean return, standard deviation, max drawdown, turnover, market correlation, and metadata (creation date, training ticker, period, complexity). -
alpha_library.py—AlphaLibraryclass that manages a collection ofAlphaRecordobjects. Supports adding alphas from training history, saving/loading to JSON, filtering by performance metrics, evaluating alphas on new data, and combining multiple alphas into an ensemble. -
generate_signals.py— Evaluates discovered alphas on new OHLCV data and generates trading signals (long / short / neutral). Supports equal-weight and Sharpe-weight ensemble combination methods. -
save_alphas.py— Utility to persist the best alphas from a training history dictionary into the alpha library.
Utilities for acquiring and preparing the OHLCV dataset:
download_sp500_history.py— Downloads the S&P 500 historical components CSV from GitHub (used to avoid survivorship bias).sp500_tickers.py— Extracts S&P 500 constituents at a specific historical date (default:2016-02-09), ensuring the training set reflects the index composition at the time of the data.download_ohlcv.py— Downloads daily OHLCV data for all tickers viayfinance. Default date range:2016-02-09to2022-02-09.
visualize_inference.py— Generates presentation-ready plots showing the agent's step-by-step inference process: action probability distributions, expression tree growth, stack state evolution, and final evaluation. Outputs to thefigures/directory.
Unit and integration tests for each component:
dsl_test.py— Tests DSL token creation and operator behavior.expr_test.py— Tests expression tree building and evaluation.env_test.py— Tests environment reset, step, action masking, and reward computation.agentPPO_test.py— Tests PPO agent action selection, update, and save/load.agentREINF_test.py— Tests REINFORCE agent baseline.
The reward function is carefully designed to avoid common pitfalls in quantitative finance:
-
No future leakage: Rolling rank normalization uses only past data within a 60-day window. Forward returns use next-period open-to-close (not close-to-close), modeling the realistic scenario of observing at close of day
$T$ , entering at open of day$T+1$ , and exiting at close of day$T+1$ . -
Rank-based normalization: Instead of z-score normalization (which gives trending signals like raw
Closea trivially profitable trend-following signal), the alpha is converted to a 60-day rolling percentile rank. A monotonic series gets rank ≈ 0.5 → near-zero signal after centering. -
Cross-sectional evaluation: Evaluating on 19 diversified tickers across multiple sectors prevents overfitting to a single stock's idiosyncrasies.
-
Rolling evaluation: The Sharpe ratio is computed over the entire valid evaluation range (not just a small forward window), providing a robust performance estimate.
-
Consistency adjustment: Cross-ticker agreement (>70% positive → bonus, <30% → penalty) favors expressions that generalize rather than exploit noise in a few tickers.
-
Structural guards: Expressions that don't reference any OHLCV data or alpha primitives receive a heavy penalty (
$-1.0$ ), preventing the agent from gaming the reward with constant expressions.
The system expects a CSV file at data/ohlcv_data.csv with the following columns:
| Column | Type | Description |
|---|---|---|
Date |
string/datetime | Trading date |
Ticker |
string | Stock symbol (e.g., AAPL) |
Open |
float | Opening price |
High |
float | Highest price |
Low |
float | Lowest price |
Close |
float | Closing price |
Volume |
float/int | Trading volume |
Multiple tickers should be stacked vertically (long format). Each ticker should have at least ~320 rows of data (120 warmup + 200 minimum evaluation days).
numpy>=1.21
pandas>=1.3
torch>=1.10
matplotlib>=3.5
yfinance>=0.2 # only for data_pipeline/ download scriptsNo GPU is required — training runs on CPU by default. For large-scale experiments, set device='cuda' in the agent constructor.
# Clone the repository
git clone <repository-url>
cd alpha_engine
# Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Install dependencies
pip install numpy pandas torch matplotlib
# Optional: only needed for downloading data
pip install yfinanceOption A — Download from S&P 500 (recommended):
cd PIPELINE_CLEAN
# Step 1: Download S&P 500 historical components
python data_pipeline/download_sp500_history.py
# Step 2: Extract tickers at a historical date (avoids survivorship bias)
python data_pipeline/sp500_tickers.py
# Step 3: Download OHLCV data for those tickers
python data_pipeline/download_ohlcv.pyOption B — Provide your own data:
Place your OHLCV CSV at PIPELINE_CLEAN/data/ohlcv_data.csv. Ensure it contains columns: Date, Ticker, Open, High, Low, Close, Volume, with multiple tickers stacked vertically.
cd PIPELINE_CLEAN
python train/train_PPO_enh_multi_v2.pyThis will:
- Load the data and select 19 diversified training tickers.
- Train for 150,000 episodes (~2,344 PPO updates with batch size 64).
- Print progress every 500 episodes with rolling reward, average Sharpe, and per-ticker breakdown.
- Save the trained agent to
models/agent_batch_v2_primitives.pt. - Save discovered alphas to
alphas/batch_v2_alpha_library.json. - Print the top 30 discovered expressions, a per-ticker breakdown of the best alpha, and a market regime analysis.
Edit the bottom of train_PPO_enh_multi_v2.py to change:
agent, history = train_batch_v2(
data=data,
tickers=training_tickers,
n_episodes=150000, # Number of training episodes
max_steps=20, # Max expression depth
hidden_size=256, # Network size
learning_rate=0.0001, # Learning rate
entropy_coef=0.10, # Initial exploration coefficient (decays to 0.02)
batch_episodes=64, # Episodes per PPO update
print_every=500, # Logging frequency
save_path="models/my_agent.pt"
)After training, examine:
- Console output: Top 30 expressions ranked by Sharpe, per-ticker breakdown of the best alpha, and market regime analysis (trending / volatile / calm).
alphas/batch_v2_alpha_library.json: Machine-readable library of discovered alphas with metadata.models/batch_v2_training_history.pkl: Full training history for custom analysis.
import pickle
with open("models/batch_v2_training_history.pkl", "rb") as f:
history = pickle.load(f)
# history['rewards'] — list of per-episode rewards
# history['expressions'] — list of expression strings
# history['losses'] — list of PPO loss values
# history['ticker_sharpes'] — list of {ticker: sharpe} dicts
# history['episode_dates'] — list of sampled dates
# history['market_features'] — list of market feature vectorsAfter training, use the discovered alphas to generate signals on new data:
from alpha_trading import AlphaLibrary
library = AlphaLibrary.load("alphas/batch_v2_alpha_library.json")
# Use generate_signals.py to evaluate alphas on new OHLCV datapython simulations/visualize_inference.pyGenerates plots in the figures/ directory showing the agent's step-by-step expression building process.
Typical expressions found by the agent (results vary by dataset):
alpha_mean_rev(5) # 5-day mean reversion
sub(alpha_mom(20), alpha_mom(5)) # Momentum cross-over
mul(alpha_rsi(14), neg(alpha_vol_surge(10))) # RSI filtered by unusual volume
ts_zscore(sub(Close, ts_mean(Close, 20)), 60) # Z-score of deviation from 20-day MA
div(delta(Close, 1), ts_std(Close, 10)) # Normalized daily change
Each expression is fully interpretable and can be evaluated on new data with a single call to expr.evaluate(new_data).
Created by Giovanni Zara and Paolo Laurenti
See LICENSE.txt.