Key Results

Key Results: Paper #1 Findings

Paper: "Validating Large Language Model Understanding of Market Microstructure Through Obfuscation Testing"

Status: Submitted to IEEE LLM-Finance 2025 Workshop (October 26, 2025)

Primary Results (Unbiased, Full 2024)

Aggregate Performance

Metric	Result
Detection Rate	71.5% average across 3 patterns
Predictive Accuracy	91.2% (predictions materialize)
Sample Size	726 tests (242 days × 3 patterns)
Validation Period	Full 2024 (Q1, Q3, Q4)
Obfuscation	✅ Enabled (no temporal context)

Pattern-Specific Results (Unbiased Prompts, Full 2024)

Pattern	Detection	Accuracy	Sample
Gamma Positioning	67.4%	97.5%	242
Stock Pinning	74.0%	90.1%	242
0DTE Hedging	73.1%	86.0%	242
Average	71.5%	91.2%	726

Critical Finding: Detection-Profitability Divergence

The Discovery

LLM detection remains stable even as economic profitability disappears.

Full Year Comparison (Gamma Positioning Example)

Quarter	Detection	Accuracy	Net Alpha	Sample
Q1 2024	100%	96.2%	+0.21%	53
Q3 2024	100%	98.4%	+0.04%	64
Q4 2024	100%	98.4%	-0.01%	64

Pattern: Detection (100%) and accuracy (96-98%) stay stable while alpha declines from +21 bps → -1 bp

Why This Matters

If LLM was memorizing profits:

Detection would decline with profitability
LLM would learn "this pattern doesn't work anymore"

Observed behavior:

Detection: Stable (100% across all quarters)
Accuracy: Stable (96-98%)
Profitability: Declined to zero

Conclusion: LLM detects market structure, not profit opportunities

Validation: Methodology successfully prevents temporal context leakage

Multi-Pattern Full Year Results

All 9 Quarter-Pattern Combinations

Pattern	Q1 Det	Q1 Acc	Q3 Det	Q3 Acc	Q4 Det	Q4 Acc
Gamma Positioning	100%	96.2%	100%	98.4%	100%	98.4%
Stock Pinning	100%	86.5%	100%	92.2%	100%	92.1%
0DTE Hedging	100%	90.4%	100%	92.2%	100%	88.9%

Total: 181 trading days, 543 individual tests, 100% detection across all combinations

Alpha Decline Across Patterns

Pattern	Q1 Alpha	Q3 Alpha	Q4 Alpha	Decline
Gamma Positioning	+0.21%	+0.04%	-0.01%	-22 bps
Stock Pinning	+0.21%	+0.05%	-0.01%	-22 bps
0DTE Hedging	+0.70%	+0.05%	-0.01%	-71 bps

Consistency: All three patterns show same alpha decline trajectory

Implication: Market-wide efficiency improvement, not pattern-specific failure

Comparison: Biased vs Unbiased Prompts (Issue #90 Discovery)

The Problem

Initial testing (Q3+Q4 2024, 128 days) used biased prompts that assumed pattern existed.

Biased Prompt:

"Dealers are short gamma. What are they forced to do?"

Problem: By assuming pattern exists, detection rate artificially inflated to 100%

The Fix

Unbiased Prompt (Full 2024, 242 days):

"Analyze this market data. WHO is forcing WHOM to do WHAT? If no pattern exists, say so."

Change: LLM can now say "no pattern detected" on some days

Results Comparison

Prompt Type	Sample	Detection	Accuracy
Biased (Q3+Q4)	128 days	100%	87.5-93.0%
Unbiased (Full 2024)	242 days	67.4-77.7%	86.5-98.4%

Key Insights:

Detection drops (100% → 71.5%) when LLM can reject pattern
Accuracy similar/higher (87-93% → 91%) with unbiased prompts
67-78% detection is realistic - pattern doesn't exist every day
Bias inflates detection, not accuracy - predictions still materialize

Pattern Consolidation Discovery

Three Patterns = One Mechanism

After full year testing, we discovered all three patterns are narrative variations of dealer gamma hedging constraints.

Evidence:

Identical Detection Rates:
- Q1 2024: All three patterns show 100% detection
- Q3 2024: All three patterns show 100% detection
- Q4 2024: All three patterns show 100% detection
Similar Accuracy:
- Q1 range: 86.5-96.2% (10-point spread)
- Q3 range: 92.2-98.4% (6-point spread)
- Q4 range: 88.9-98.4% (10-point spread)
Parallel Alpha Decline:
- All three decline Q1→Q4 (correlation ~0.95)
- Same quarterly trajectory
- Same terminal alpha (~0%)

Conclusion: LLM detects the underlying constraint (dealer gamma hedging), not surface keywords

Strength: Proves detection is robust to narrative framing

Statistical Significance

Sample Sizes

Test Level	Sample Size	Significance
Per pattern (full year)	242 days	✅ High (n>30)
Per quarter-pattern	53-64 days	✅ High (n>30)
Aggregate (all patterns)	726 tests	✅ Very high

Confidence: Results are statistically robust (well above n=30 threshold)

Detection Rate Significance

Null Hypothesis: LLM detects randomly (50% base rate)

Observed: 71.5% average detection

Z-test: p < 0.001 (highly significant)

Conclusion: Detection is not random

Accuracy Significance

Null Hypothesis: Predictions are coin flips (50% accuracy)

Observed: 91.2% average accuracy

Z-test: p < 0.001 (highly significant)

Conclusion: Predictions materialize significantly above chance

Generalization Evidence

1. Multi-Pattern Validation

Test: 3 different narrative framings of same constraint

Gamma positioning (general frame)
Stock pinning (expiration-focused frame)
0DTE hedging (0DTE-focused frame)

Result: All three show high detection/accuracy

Conclusion: Detection generalizes across narrative variations

2. Multi-Quarter Validation

Test: Q1, Q3, Q4 across 2024 (different volatility regimes)

Result: Detection stable across quarters despite alpha decline

Conclusion: Detection robust to changing market conditions

3. Multi-Year Validation (Partial)

Test: Spot-checked 2022-2023 (limited data)

Result: Similar detection patterns

Conclusion: Suggests temporal generalization (needs full validation)

Obfuscation Validation

Methodology Test

Question: Does obfuscation truly prevent temporal context leakage?

Test: If LLM used temporal context, detection would track profitability

Observed:

Detection: Stable (100% Q1→Q4)
Profitability: Declined (+21 bps → -1 bp)
Correlation: Near zero

Conclusion: LLM not using temporal context (obfuscation works)

Sanity Check: Biased vs Unbiased

Biased prompts (assume pattern exists):

Detection: 100% (by design)
Accuracy: 87-93% (predictions still materialize)

Unbiased prompts (can reject pattern):

Detection: 67-78% (realistic)
Accuracy: 86-98% (similar or higher)

Key Finding: Even when forced to detect, LLM's predictions materialize

Implication: Pattern is real (not hallucination), LLM correctly identifies it

Outcome Verification

Forward Returns

Metric: T+1 forward return (next-day spot move)

Calculation:

forward_return_t1 = (spot_t1 - spot_t0) / spot_t0

Prediction Check:

LLM says "UP" and forward_return_t1 > 0 → Correct
LLM says "DOWN" and forward_return_t1 < 0 → Correct

Accuracy: 91.2% of predictions match forward return direction

Realized Volatility

Metric: Max gain/loss over next 3 days

Observation: Negative GEX days show higher realized volatility

Correlation: -0.65 (net GEX vs realized vol)

Validation: Dealer hedging amplifies moves (as theory predicts)

Limitations

What We Did NOT Test

Causal Attribution:
- Did spot move BECAUSE of dealer hedging?
- Or did both occur for unrelated reasons?
Alternative Explanations:
- Could other mechanisms produce same patterns?
- Need more controlled experiments
Cross-Asset Generalization:
- Only tested SPY (index options)
- Individual stocks may differ (Paper #3 topic)
Sequential Dynamics:
- Only single-day snapshots
- 5-day trajectories not tested (Paper #2 topic)

Known Issues

Q2 2024 Missing:
- Data quality issues
- Excluded from validation
- Future work: Rebuild Q2 database
Options Data Gaps:
- yfinance has incomplete historical data
- Some dates missing
- Premium data source needed for complete coverage
LLM Model Dependence:
- Only tested GPT-4/GPT-4o-mini
- Other LLMs may differ (future work)

Key Takeaways

1. LLMs Can Detect Structural Constraints

Evidence: 71.5% detection rate without temporal context

Conclusion: LLMs reason about market mechanics, not just pattern matching

2. Detection ≠ Profitability

Evidence: 100% detection while alpha declined to zero

Conclusion: LLM detects market structure, not trading opportunities

Implication: Methodology prevents temporal leakage (validates obfuscation)

3. Multi-Pattern Generalization

Evidence: 3 narrative variations → same underlying constraint detected

Conclusion: LLM identifies causal mechanism, not surface keywords

4. Predictions Materialize

Evidence: 91.2% accuracy (significantly above chance)

Conclusion: Detected constraints actually drive price action

5. Obfuscation Testing Works

Evidence: Biased vs unbiased comparison, detection-profitability divergence

Conclusion: Framework successfully distinguishes understanding from memorization

Future Work

See Research Roadmap for detailed plans.

Near-term (2026):

Paper #2: Sequential GEX Analysis (temporal dynamics)
Paper #3: Cross-Asset Generalization (individual equities)

Long-term (2026+): 3. Pattern Discovery: Can LLMs discover novel constraints? 4. Comparative LLMs: GPT-4 vs Claude vs o3-mini 5. Real-Time Applications: Market surveillance systems

Validation Reports

Full Results: reports/validation/pattern_taxonomy/

Archive: docs/archive/multipattern_validation_2024.md

Paper #1 Content: docs/papers/paper1/

Last Updated: October 25, 2025

Key Results

Key Results: Paper #1 Findings

Primary Results (Unbiased, Full 2024)

Aggregate Performance

Pattern-Specific Results (Unbiased Prompts, Full 2024)

Critical Finding: Detection-Profitability Divergence

The Discovery

Full Year Comparison (Gamma Positioning Example)

Why This Matters

Multi-Pattern Full Year Results

All 9 Quarter-Pattern Combinations

Alpha Decline Across Patterns

Comparison: Biased vs Unbiased Prompts (Issue #90 Discovery)

The Problem

The Fix

Results Comparison

Pattern Consolidation Discovery

Three Patterns = One Mechanism

Statistical Significance

Sample Sizes

Detection Rate Significance

Accuracy Significance

Generalization Evidence

1. Multi-Pattern Validation

2. Multi-Quarter Validation

3. Multi-Year Validation (Partial)

Obfuscation Validation

Methodology Test

Sanity Check: Biased vs Unbiased

Outcome Verification

Forward Returns

Realized Volatility

Limitations

What We Did NOT Test

Known Issues

Key Takeaways

1. LLMs Can Detect Structural Constraints

2. Detection ≠ Profitability

3. Multi-Pattern Generalization

4. Predictions Materialize

5. Obfuscation Testing Works

Future Work

Validation Reports

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally