Methodology

Methodology: Obfuscation Testing Framework

Core Question: How do we know if LLMs truly understand market constraints vs. simply memorize patterns from training data?

Solution: Obfuscation Testing - Strip all temporal context and force reasoning purely from structure.

The Problem: Training Data Leakage

Why Standard Testing Fails

When testing LLMs on financial markets, we face a fundamental challenge:

Problem: LLMs may have seen similar data during training

Historical market data widely available online
News articles, research papers, trading forums
Pattern descriptions in public documentation

Risk: Detection could come from memorization, not understanding

LLM recognizes "January 2024" → recalls market events
LLM sees "SPY" → activates financial domain knowledge
LLM pattern-matches keywords rather than understanding constraints

Standard Approaches Are Insufficient

❌ Test on recent data: Still may have seen it (training cutoffs unclear) ❌ Use different tickers: Doesn't eliminate temporal context ❌ Ask for explanations: LLMs can generate plausible-sounding reasoning without true understanding

Our Solution: Obfuscation Testing

Core Principle

Strip ALL temporal and contextual information that could enable memorization

Instead of:

Date: January 2, 2024
Ticker: SPY
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60

We present:

Day T+0 (obfuscated test day)
Asset: INDEX_1
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60

What Gets Obfuscated

Dates: "2024-01-02" → "Day T+0"
Tickers: "SPY" → "INDEX_1"
Relative dates: "T+1", "T+7", "T+30" (no weekday/month clues)
Events: No FOMC meetings, earnings, holidays mentioned
Context: No news, no market regime descriptions

What Remains

Market structure: GEX values, strikes, volumes
Options mechanics: Calls/puts, expirations, IV
Dealer constraints: Regulatory requirements (delta neutrality)
Physical realities: Time decay, gamma explosion

The WHO → WHOM → WHAT Framework

Obfuscation alone isn't enough. We require explicit causal identification.

Three-Part Analysis

WHO: Identify market participants
- Dealers, retail traders, institutional hedgers
WHOM: Who is being forced/constrained?
- Not who benefits, but who has no choice
WHAT: What action are they forced to take?
- Specific, verifiable trading behavior

Example: Gamma Positioning

WHO: Options dealers (market makers)

WHOM: Dealers are forced by:

Regulatory mandate: Must maintain delta neutrality (can't hold directional risk)
Risk limits: Large gamma positions create unacceptable volatility exposure

WHAT: Dealers must:

Continuously rebalance hedges as spot price moves
Buy underlying when price falls (short gamma forces buying into weakness)
Sell underlying when price rises (short gamma forces selling into strength)

Key: This isn't a choice—it's a constraint. Dealers face regulatory/risk penalties if they don't comply.

Validation: Mechanical vs Narrative Patterns

We classify patterns into two categories:

Mechanical Patterns ✅

Definition: Patterns driven by constraints dealers cannot avoid

Characteristics:

Regulatory mandate (delta neutrality)
Physical reality (time decay)
Risk limits (gamma explosion)
Contractual obligation (settlement rules)

Examples:

Gamma positioning (regulatory requirement)
Stock pinning (time decay + delta hedging)
0DTE hedging (concentrated expiration risk)

Expected LLM Behavior: High detection rate even with obfuscation

Narrative Patterns ❌

Definition: Patterns requiring temporal/contextual knowledge

Characteristics:

Time-dependent (knowing "Friday 3:30 PM")
Event-driven (FOMC meetings, earnings)
Statistical anomalies (volume spikes without mechanism)
Context-dependent (works sometimes, not always)

Examples:

"Friday 3:30 squeeze" (requires knowing day of week)
"FOMC drift" (requires knowing FOMC dates)
"Volume anomaly" (no mechanical constraint)

Expected LLM Behavior: Low detection rate with obfuscation (reveals memorization)

Success Criteria

Detection Rate

Metric: Percentage of test days where LLM correctly identifies constraint

Threshold: ≥60% detection rate (significantly better than random)

Interpretation:

100%: Perfect mechanical understanding
60-80%: Strong structural detection
<60%: Pattern may be narrative, not mechanical

Predictive Accuracy

Metric: Percentage of predictions that materialize

Calculation:

# LLM predicts: "Dealers forced to buy, expect upward pressure"
# Verification: Check if SPY actually moved up on T+1

if prediction.direction == "UP" and forward_return > 0:
    prediction_correct = True

Threshold: ≥80% accuracy (predictions must materialize)

Interpretation:

High accuracy: LLM understands causal mechanism
Low accuracy: LLM detecting pattern that doesn't drive price action

Sample Size

Requirement: Minimum 30 samples per pattern

Rationale: Statistical significance

Our Implementation: 242 trading days × 3 patterns = 726 tests

Obfuscation Implementation

DataObfuscator Class

Located: src/data_sources/data_obfuscator.py

Key Features:

Date Obfuscation: Maps real dates → "Day T+X" format
Ticker Obfuscation: Maps "SPY" → "INDEX_1"
Consistency: Same asset always gets same obfuscated name within experiment
Reversibility: Maintains mapping for verification

Example Usage:

from src.data_sources.data_obfuscator import DataObfuscator

obfuscator = DataObfuscator()

# Obfuscate data
obfuscated = obfuscator.obfuscate_data(
    gex_data=gex_results,
    test_date=datetime(2024, 1, 2),
    ticker="SPY"
)

# LLM sees:
# Day T+0, INDEX_1, Net GEX: -$8.95B

Prompt Template (Obfuscated)

OBFUSCATED_PROMPT = """
You are analyzing options market mechanics on {obfuscated_date}.

**Market Data** (Asset: {obfuscated_ticker}):
- Spot Price: ${spot_price:.2f}
- Net GEX: ${net_gex_billions:.2f}B
- GEX Distribution: {gex_distribution}

**Question**: WHO is forcing WHOM to do WHAT?

**Requirements**:
1. Identify market participants and their constraints
2. Explain the FORCING mechanism (regulation, risk, physics)
3. Predict what actions are FORCED (not chosen)
4. Assign confidence (0-100%)

**No real dates, tickers, or events are provided. Reason from structure alone.**
"""

Verification Process

Step 1: Detection

Run LLM on obfuscated data → Did it detect the constraint?

Pass: LLM identifies dealers, gamma hedging, forced buying/selling

Fail: LLM says "no pattern" or detects wrong constraint

Step 2: Accuracy

Check if prediction materialized using forward returns

Data: OutcomeCalculator computes T+1, T+3 forward returns

Verification:

# LLM predicted "dealers forced to buy → upward pressure"
if llm_prediction == "UP" and forward_return_t1 > 0:
    accurate = True

Step 3: Aggregation

Compute detection rate and accuracy across all test days

Output: YAML validation report

Example:

pattern_name: gamma_positioning
detection_rate_pct: 100.0
predictive_accuracy_pct: 96.2
sample_size: 53

Critical Finding: Detection-Profitability Divergence

The Experiment

Validated 3 patterns across full 2024 (Q1, Q3, Q4 = 181 days)

Result:

Detection: Stable (84-100%) across all quarters
Accuracy: High (87-98%) across all quarters
Profitability: Declined from +21-70 bps (Q1) → -1 to +5 bps (Q4)

Why This Matters

If LLM was memorizing profits: Detection would decline with profitability

Observed behavior: Detection stable while profits disappeared

Conclusion: LLM detects structural constraints, not profit opportunities

Implication: Methodology successfully prevents temporal context leakage

Comparison: Biased vs Unbiased Prompts

Discovery (Issue #90)

Initial testing used biased prompts that assumed pattern existed:

Biased Prompt (Q3+Q4 2024, 128 days):

"Dealers are short gamma. What are they forced to do?"

Results:

Detection: 100% (by design—prompt assumes pattern)
Accuracy: 87.5-93.0% (predictions materialized)

Fix: Unbiased Prompts

Unbiased Prompt (Full 2024, 242 days):

"Analyze this market data. WHO is forcing WHOM to do WHAT? If no pattern exists, say so."

Results:

Detection: 67.4-77.7% (realistic detection rate)
Accuracy: 86.5-98.4% (similar accuracy)

Key Insight: Biased prompts inflate detection but don't affect accuracy (predictions still materialize)

Limitations

What Obfuscation Doesn't Solve

LLM training on methodology: If LLM trained on this exact obfuscation approach
Common pattern names: "Gamma positioning" may trigger memorized associations
Indirect temporal clues: GEX magnitudes might correlate with time periods

Mitigations

Test multiple pattern framings: Same constraint, different narratives
Vary obfuscation schemes: T+0 vs Day 1 vs Test Day
Compare to random data: Ensure LLM doesn't hallucinate patterns

Future Work

Comparative LLMs: Test GPT-4, Claude, o3-mini (do all detect same patterns?)
Blind validation: Present random GEX data, check false positive rate
Formal verification: Mathematical proof of constraint existence

Reproducibility

Full Methodology Available

Code: All obfuscation logic in src/data_sources/data_obfuscator.py

Validation Scripts: scripts/validation/validate_pattern_taxonomy.py

Prompts: Documented in src/agents/market_mechanics_agent.py

Results: YAML files in reports/validation/pattern_taxonomy/

Running Your Own Validation

# Set up environment
export PYTHONPATH=$(pwd):$PYTHONPATH
export OPENAI_API_KEY="your-key"

# Run obfuscated validation
python scripts/validation/validate_pattern_taxonomy.py \
  --pattern gamma_positioning \
  --symbol SPY \
  --start-date 2024-01-02 \
  --end-date 2024-03-29 \
  --confidence 60.0

Output: YAML report with detection rate, accuracy, sample details

References

See docs/guides/ for detailed conceptual explanations:

obfuscation-testing-explained.md - In-depth methodology guide
gex-metrics-explained.md - Gamma exposure calculations
validation-framework.md - Pattern taxonomy and success criteria

Last Updated: October 25, 2025

Methodology

Methodology: Obfuscation Testing Framework

The Problem: Training Data Leakage

Why Standard Testing Fails

Standard Approaches Are Insufficient

Our Solution: Obfuscation Testing

Core Principle

What Gets Obfuscated

What Remains

The WHO → WHOM → WHAT Framework

Three-Part Analysis

Example: Gamma Positioning

Validation: Mechanical vs Narrative Patterns

Mechanical Patterns ✅

Narrative Patterns ❌

Success Criteria

Detection Rate

Predictive Accuracy

Sample Size

Obfuscation Implementation

DataObfuscator Class

Prompt Template (Obfuscated)

Verification Process

Step 1: Detection

Step 2: Accuracy

Step 3: Aggregation

Critical Finding: Detection-Profitability Divergence

The Experiment

Why This Matters

Comparison: Biased vs Unbiased Prompts

Discovery (Issue #90)

Fix: Unbiased Prompts

Limitations

What Obfuscation Doesn't Solve

Mitigations

Future Work

Reproducibility

Full Methodology Available

Running Your Own Validation

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally