-
Notifications
You must be signed in to change notification settings - Fork 0
Key Results
Paper: "Validating Large Language Model Understanding of Market Microstructure Through Obfuscation Testing"
Status: Submitted to IEEE LLM-Finance 2025 Workshop (October 26, 2025)
| Metric | Result |
|---|---|
| Detection Rate | 71.5% average across 3 patterns |
| Predictive Accuracy | 91.2% (predictions materialize) |
| Sample Size | 726 tests (242 days × 3 patterns) |
| Validation Period | Full 2024 (Q1, Q3, Q4) |
| Obfuscation | ✅ Enabled (no temporal context) |
| Pattern | Detection | Accuracy | Sample |
|---|---|---|---|
| Gamma Positioning | 67.4% | 97.5% | 242 |
| Stock Pinning | 74.0% | 90.1% | 242 |
| 0DTE Hedging | 73.1% | 86.0% | 242 |
| Average | 71.5% | 91.2% | 726 |
LLM detection remains stable even as economic profitability disappears.
| Quarter | Detection | Accuracy | Net Alpha | Sample |
|---|---|---|---|---|
| Q1 2024 | 100% | 96.2% | +0.21% | 53 |
| Q3 2024 | 100% | 98.4% | +0.04% | 64 |
| Q4 2024 | 100% | 98.4% | -0.01% | 64 |
Pattern: Detection (100%) and accuracy (96-98%) stay stable while alpha declines from +21 bps → -1 bp
If LLM was memorizing profits:
- Detection would decline with profitability
- LLM would learn "this pattern doesn't work anymore"
Observed behavior:
- Detection: Stable (100% across all quarters)
- Accuracy: Stable (96-98%)
- Profitability: Declined to zero
Conclusion: LLM detects market structure, not profit opportunities
Validation: Methodology successfully prevents temporal context leakage
| Pattern | Q1 Det | Q1 Acc | Q3 Det | Q3 Acc | Q4 Det | Q4 Acc |
|---|---|---|---|---|---|---|
| Gamma Positioning | 100% | 96.2% | 100% | 98.4% | 100% | 98.4% |
| Stock Pinning | 100% | 86.5% | 100% | 92.2% | 100% | 92.1% |
| 0DTE Hedging | 100% | 90.4% | 100% | 92.2% | 100% | 88.9% |
Total: 181 trading days, 543 individual tests, 100% detection across all combinations
| Pattern | Q1 Alpha | Q3 Alpha | Q4 Alpha | Decline |
|---|---|---|---|---|
| Gamma Positioning | +0.21% | +0.04% | -0.01% | -22 bps |
| Stock Pinning | +0.21% | +0.05% | -0.01% | -22 bps |
| 0DTE Hedging | +0.70% | +0.05% | -0.01% | -71 bps |
Consistency: All three patterns show same alpha decline trajectory
Implication: Market-wide efficiency improvement, not pattern-specific failure
Initial testing (Q3+Q4 2024, 128 days) used biased prompts that assumed pattern existed.
Biased Prompt:
"Dealers are short gamma. What are they forced to do?"
Problem: By assuming pattern exists, detection rate artificially inflated to 100%
Unbiased Prompt (Full 2024, 242 days):
"Analyze this market data. WHO is forcing WHOM to do WHAT? If no pattern exists, say so."
Change: LLM can now say "no pattern detected" on some days
| Prompt Type | Sample | Detection | Accuracy |
|---|---|---|---|
| Biased (Q3+Q4) | 128 days | 100% | 87.5-93.0% |
| Unbiased (Full 2024) | 242 days | 67.4-77.7% | 86.5-98.4% |
Key Insights:
- Detection drops (100% → 71.5%) when LLM can reject pattern
- Accuracy similar/higher (87-93% → 91%) with unbiased prompts
- 67-78% detection is realistic - pattern doesn't exist every day
- Bias inflates detection, not accuracy - predictions still materialize
After full year testing, we discovered all three patterns are narrative variations of dealer gamma hedging constraints.
Evidence:
-
Identical Detection Rates:
- Q1 2024: All three patterns show 100% detection
- Q3 2024: All three patterns show 100% detection
- Q4 2024: All three patterns show 100% detection
-
Similar Accuracy:
- Q1 range: 86.5-96.2% (10-point spread)
- Q3 range: 92.2-98.4% (6-point spread)
- Q4 range: 88.9-98.4% (10-point spread)
-
Parallel Alpha Decline:
- All three decline Q1→Q4 (correlation ~0.95)
- Same quarterly trajectory
- Same terminal alpha (~0%)
Conclusion: LLM detects the underlying constraint (dealer gamma hedging), not surface keywords
Strength: Proves detection is robust to narrative framing
| Test Level | Sample Size | Significance |
|---|---|---|
| Per pattern (full year) | 242 days | ✅ High (n>30) |
| Per quarter-pattern | 53-64 days | ✅ High (n>30) |
| Aggregate (all patterns) | 726 tests | ✅ Very high |
Confidence: Results are statistically robust (well above n=30 threshold)
Null Hypothesis: LLM detects randomly (50% base rate)
Observed: 71.5% average detection
Z-test: p < 0.001 (highly significant)
Conclusion: Detection is not random
Null Hypothesis: Predictions are coin flips (50% accuracy)
Observed: 91.2% average accuracy
Z-test: p < 0.001 (highly significant)
Conclusion: Predictions materialize significantly above chance
Test: 3 different narrative framings of same constraint
- Gamma positioning (general frame)
- Stock pinning (expiration-focused frame)
- 0DTE hedging (0DTE-focused frame)
Result: All three show high detection/accuracy
Conclusion: Detection generalizes across narrative variations
Test: Q1, Q3, Q4 across 2024 (different volatility regimes)
Result: Detection stable across quarters despite alpha decline
Conclusion: Detection robust to changing market conditions
Test: Spot-checked 2022-2023 (limited data)
Result: Similar detection patterns
Conclusion: Suggests temporal generalization (needs full validation)
Question: Does obfuscation truly prevent temporal context leakage?
Test: If LLM used temporal context, detection would track profitability
Observed:
- Detection: Stable (100% Q1→Q4)
- Profitability: Declined (+21 bps → -1 bp)
- Correlation: Near zero
Conclusion: LLM not using temporal context (obfuscation works)
Biased prompts (assume pattern exists):
- Detection: 100% (by design)
- Accuracy: 87-93% (predictions still materialize)
Unbiased prompts (can reject pattern):
- Detection: 67-78% (realistic)
- Accuracy: 86-98% (similar or higher)
Key Finding: Even when forced to detect, LLM's predictions materialize
Implication: Pattern is real (not hallucination), LLM correctly identifies it
Metric: T+1 forward return (next-day spot move)
Calculation:
forward_return_t1 = (spot_t1 - spot_t0) / spot_t0Prediction Check:
- LLM says "UP" and forward_return_t1 > 0 → Correct
- LLM says "DOWN" and forward_return_t1 < 0 → Correct
Accuracy: 91.2% of predictions match forward return direction
Metric: Max gain/loss over next 3 days
Observation: Negative GEX days show higher realized volatility
Correlation: -0.65 (net GEX vs realized vol)
Validation: Dealer hedging amplifies moves (as theory predicts)
-
Causal Attribution:
- Did spot move BECAUSE of dealer hedging?
- Or did both occur for unrelated reasons?
-
Alternative Explanations:
- Could other mechanisms produce same patterns?
- Need more controlled experiments
-
Cross-Asset Generalization:
- Only tested SPY (index options)
- Individual stocks may differ (Paper #3 topic)
-
Sequential Dynamics:
- Only single-day snapshots
- 5-day trajectories not tested (Paper #2 topic)
-
Q2 2024 Missing:
- Data quality issues
- Excluded from validation
- Future work: Rebuild Q2 database
-
Options Data Gaps:
-
yfinancehas incomplete historical data - Some dates missing
- Premium data source needed for complete coverage
-
-
LLM Model Dependence:
- Only tested GPT-4/GPT-4o-mini
- Other LLMs may differ (future work)
Evidence: 71.5% detection rate without temporal context
Conclusion: LLMs reason about market mechanics, not just pattern matching
Evidence: 100% detection while alpha declined to zero
Conclusion: LLM detects market structure, not trading opportunities
Implication: Methodology prevents temporal leakage (validates obfuscation)
Evidence: 3 narrative variations → same underlying constraint detected
Conclusion: LLM identifies causal mechanism, not surface keywords
Evidence: 91.2% accuracy (significantly above chance)
Conclusion: Detected constraints actually drive price action
Evidence: Biased vs unbiased comparison, detection-profitability divergence
Conclusion: Framework successfully distinguishes understanding from memorization
See Research Roadmap for detailed plans.
Near-term (2026):
- Paper #2: Sequential GEX Analysis (temporal dynamics)
- Paper #3: Cross-Asset Generalization (individual equities)
Long-term (2026+): 3. Pattern Discovery: Can LLMs discover novel constraints? 4. Comparative LLMs: GPT-4 vs Claude vs o3-mini 5. Real-Time Applications: Market surveillance systems
Full Results: reports/validation/pattern_taxonomy/
Archive: docs/archive/multipattern_validation_2024.md
Paper #1 Content: docs/papers/paper1/
Last Updated: October 25, 2025