-
Notifications
You must be signed in to change notification settings - Fork 0
Methodology
Core Question: How do we know if LLMs truly understand market constraints vs. simply memorize patterns from training data?
Solution: Obfuscation Testing - Strip all temporal context and force reasoning purely from structure.
When testing LLMs on financial markets, we face a fundamental challenge:
Problem: LLMs may have seen similar data during training
- Historical market data widely available online
- News articles, research papers, trading forums
- Pattern descriptions in public documentation
Risk: Detection could come from memorization, not understanding
- LLM recognizes "January 2024" → recalls market events
- LLM sees "SPY" → activates financial domain knowledge
- LLM pattern-matches keywords rather than understanding constraints
❌ Test on recent data: Still may have seen it (training cutoffs unclear) ❌ Use different tickers: Doesn't eliminate temporal context ❌ Ask for explanations: LLMs can generate plausible-sounding reasoning without true understanding
Strip ALL temporal and contextual information that could enable memorization
Instead of:
Date: January 2, 2024
Ticker: SPY
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60
We present:
Day T+0 (obfuscated test day)
Asset: INDEX_1
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60
- Dates: "2024-01-02" → "Day T+0"
- Tickers: "SPY" → "INDEX_1"
- Relative dates: "T+1", "T+7", "T+30" (no weekday/month clues)
- Events: No FOMC meetings, earnings, holidays mentioned
- Context: No news, no market regime descriptions
- Market structure: GEX values, strikes, volumes
- Options mechanics: Calls/puts, expirations, IV
- Dealer constraints: Regulatory requirements (delta neutrality)
- Physical realities: Time decay, gamma explosion
Obfuscation alone isn't enough. We require explicit causal identification.
-
WHO: Identify market participants
- Dealers, retail traders, institutional hedgers
-
WHOM: Who is being forced/constrained?
- Not who benefits, but who has no choice
-
WHAT: What action are they forced to take?
- Specific, verifiable trading behavior
WHO: Options dealers (market makers)
WHOM: Dealers are forced by:
- Regulatory mandate: Must maintain delta neutrality (can't hold directional risk)
- Risk limits: Large gamma positions create unacceptable volatility exposure
WHAT: Dealers must:
- Continuously rebalance hedges as spot price moves
- Buy underlying when price falls (short gamma forces buying into weakness)
- Sell underlying when price rises (short gamma forces selling into strength)
Key: This isn't a choice—it's a constraint. Dealers face regulatory/risk penalties if they don't comply.
We classify patterns into two categories:
Definition: Patterns driven by constraints dealers cannot avoid
Characteristics:
- Regulatory mandate (delta neutrality)
- Physical reality (time decay)
- Risk limits (gamma explosion)
- Contractual obligation (settlement rules)
Examples:
- Gamma positioning (regulatory requirement)
- Stock pinning (time decay + delta hedging)
- 0DTE hedging (concentrated expiration risk)
Expected LLM Behavior: High detection rate even with obfuscation
Definition: Patterns requiring temporal/contextual knowledge
Characteristics:
- Time-dependent (knowing "Friday 3:30 PM")
- Event-driven (FOMC meetings, earnings)
- Statistical anomalies (volume spikes without mechanism)
- Context-dependent (works sometimes, not always)
Examples:
- "Friday 3:30 squeeze" (requires knowing day of week)
- "FOMC drift" (requires knowing FOMC dates)
- "Volume anomaly" (no mechanical constraint)
Expected LLM Behavior: Low detection rate with obfuscation (reveals memorization)
Metric: Percentage of test days where LLM correctly identifies constraint
Threshold: ≥60% detection rate (significantly better than random)
Interpretation:
- 100%: Perfect mechanical understanding
- 60-80%: Strong structural detection
- <60%: Pattern may be narrative, not mechanical
Metric: Percentage of predictions that materialize
Calculation:
# LLM predicts: "Dealers forced to buy, expect upward pressure"
# Verification: Check if SPY actually moved up on T+1
if prediction.direction == "UP" and forward_return > 0:
prediction_correct = TrueThreshold: ≥80% accuracy (predictions must materialize)
Interpretation:
- High accuracy: LLM understands causal mechanism
- Low accuracy: LLM detecting pattern that doesn't drive price action
Requirement: Minimum 30 samples per pattern
Rationale: Statistical significance
Our Implementation: 242 trading days × 3 patterns = 726 tests
Located: src/data_sources/data_obfuscator.py
Key Features:
- Date Obfuscation: Maps real dates → "Day T+X" format
- Ticker Obfuscation: Maps "SPY" → "INDEX_1"
- Consistency: Same asset always gets same obfuscated name within experiment
- Reversibility: Maintains mapping for verification
Example Usage:
from src.data_sources.data_obfuscator import DataObfuscator
obfuscator = DataObfuscator()
# Obfuscate data
obfuscated = obfuscator.obfuscate_data(
gex_data=gex_results,
test_date=datetime(2024, 1, 2),
ticker="SPY"
)
# LLM sees:
# Day T+0, INDEX_1, Net GEX: -$8.95BOBFUSCATED_PROMPT = """
You are analyzing options market mechanics on {obfuscated_date}.
**Market Data** (Asset: {obfuscated_ticker}):
- Spot Price: ${spot_price:.2f}
- Net GEX: ${net_gex_billions:.2f}B
- GEX Distribution: {gex_distribution}
**Question**: WHO is forcing WHOM to do WHAT?
**Requirements**:
1. Identify market participants and their constraints
2. Explain the FORCING mechanism (regulation, risk, physics)
3. Predict what actions are FORCED (not chosen)
4. Assign confidence (0-100%)
**No real dates, tickers, or events are provided. Reason from structure alone.**
"""Run LLM on obfuscated data → Did it detect the constraint?
Pass: LLM identifies dealers, gamma hedging, forced buying/selling
Fail: LLM says "no pattern" or detects wrong constraint
Check if prediction materialized using forward returns
Data: OutcomeCalculator computes T+1, T+3 forward returns
Verification:
# LLM predicted "dealers forced to buy → upward pressure"
if llm_prediction == "UP" and forward_return_t1 > 0:
accurate = TrueCompute detection rate and accuracy across all test days
Output: YAML validation report
Example:
pattern_name: gamma_positioning
detection_rate_pct: 100.0
predictive_accuracy_pct: 96.2
sample_size: 53Validated 3 patterns across full 2024 (Q1, Q3, Q4 = 181 days)
Result:
- Detection: Stable (84-100%) across all quarters
- Accuracy: High (87-98%) across all quarters
- Profitability: Declined from +21-70 bps (Q1) → -1 to +5 bps (Q4)
If LLM was memorizing profits: Detection would decline with profitability
Observed behavior: Detection stable while profits disappeared
Conclusion: LLM detects structural constraints, not profit opportunities
Implication: Methodology successfully prevents temporal context leakage
Initial testing used biased prompts that assumed pattern existed:
Biased Prompt (Q3+Q4 2024, 128 days):
"Dealers are short gamma. What are they forced to do?"
Results:
- Detection: 100% (by design—prompt assumes pattern)
- Accuracy: 87.5-93.0% (predictions materialized)
Unbiased Prompt (Full 2024, 242 days):
"Analyze this market data. WHO is forcing WHOM to do WHAT? If no pattern exists, say so."
Results:
- Detection: 67.4-77.7% (realistic detection rate)
- Accuracy: 86.5-98.4% (similar accuracy)
Key Insight: Biased prompts inflate detection but don't affect accuracy (predictions still materialize)
- LLM training on methodology: If LLM trained on this exact obfuscation approach
- Common pattern names: "Gamma positioning" may trigger memorized associations
- Indirect temporal clues: GEX magnitudes might correlate with time periods
- Test multiple pattern framings: Same constraint, different narratives
- Vary obfuscation schemes: T+0 vs Day 1 vs Test Day
- Compare to random data: Ensure LLM doesn't hallucinate patterns
- Comparative LLMs: Test GPT-4, Claude, o3-mini (do all detect same patterns?)
- Blind validation: Present random GEX data, check false positive rate
- Formal verification: Mathematical proof of constraint existence
Code: All obfuscation logic in src/data_sources/data_obfuscator.py
Validation Scripts: scripts/validation/validate_pattern_taxonomy.py
Prompts: Documented in src/agents/market_mechanics_agent.py
Results: YAML files in reports/validation/pattern_taxonomy/
# Set up environment
export PYTHONPATH=$(pwd):$PYTHONPATH
export OPENAI_API_KEY="your-key"
# Run obfuscated validation
python scripts/validation/validate_pattern_taxonomy.py \
--pattern gamma_positioning \
--symbol SPY \
--start-date 2024-01-02 \
--end-date 2024-03-29 \
--confidence 60.0Output: YAML report with detection rate, accuracy, sample details
See docs/guides/ for detailed conceptual explanations:
-
obfuscation-testing-explained.md- In-depth methodology guide -
gex-metrics-explained.md- Gamma exposure calculations -
validation-framework.md- Pattern taxonomy and success criteria
Last Updated: October 25, 2025