Skip to content

Methodology

WormsCanned edited this page Oct 26, 2025 · 1 revision

Methodology: Obfuscation Testing Framework

Core Question: How do we know if LLMs truly understand market constraints vs. simply memorize patterns from training data?

Solution: Obfuscation Testing - Strip all temporal context and force reasoning purely from structure.


The Problem: Training Data Leakage

Why Standard Testing Fails

When testing LLMs on financial markets, we face a fundamental challenge:

Problem: LLMs may have seen similar data during training

  • Historical market data widely available online
  • News articles, research papers, trading forums
  • Pattern descriptions in public documentation

Risk: Detection could come from memorization, not understanding

  • LLM recognizes "January 2024" → recalls market events
  • LLM sees "SPY" → activates financial domain knowledge
  • LLM pattern-matches keywords rather than understanding constraints

Standard Approaches Are Insufficient

Test on recent data: Still may have seen it (training cutoffs unclear) ❌ Use different tickers: Doesn't eliminate temporal context ❌ Ask for explanations: LLMs can generate plausible-sounding reasoning without true understanding


Our Solution: Obfuscation Testing

Core Principle

Strip ALL temporal and contextual information that could enable memorization

Instead of:

Date: January 2, 2024
Ticker: SPY
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60

We present:

Day T+0 (obfuscated test day)
Asset: INDEX_1
Net GEX: -$8.95B (negative gamma)
Spot price: $474.60

What Gets Obfuscated

  1. Dates: "2024-01-02" → "Day T+0"
  2. Tickers: "SPY" → "INDEX_1"
  3. Relative dates: "T+1", "T+7", "T+30" (no weekday/month clues)
  4. Events: No FOMC meetings, earnings, holidays mentioned
  5. Context: No news, no market regime descriptions

What Remains

  • Market structure: GEX values, strikes, volumes
  • Options mechanics: Calls/puts, expirations, IV
  • Dealer constraints: Regulatory requirements (delta neutrality)
  • Physical realities: Time decay, gamma explosion

The WHO → WHOM → WHAT Framework

Obfuscation alone isn't enough. We require explicit causal identification.

Three-Part Analysis

  1. WHO: Identify market participants

    • Dealers, retail traders, institutional hedgers
  2. WHOM: Who is being forced/constrained?

    • Not who benefits, but who has no choice
  3. WHAT: What action are they forced to take?

    • Specific, verifiable trading behavior

Example: Gamma Positioning

WHO: Options dealers (market makers)

WHOM: Dealers are forced by:

  • Regulatory mandate: Must maintain delta neutrality (can't hold directional risk)
  • Risk limits: Large gamma positions create unacceptable volatility exposure

WHAT: Dealers must:

  • Continuously rebalance hedges as spot price moves
  • Buy underlying when price falls (short gamma forces buying into weakness)
  • Sell underlying when price rises (short gamma forces selling into strength)

Key: This isn't a choice—it's a constraint. Dealers face regulatory/risk penalties if they don't comply.


Validation: Mechanical vs Narrative Patterns

We classify patterns into two categories:

Mechanical Patterns ✅

Definition: Patterns driven by constraints dealers cannot avoid

Characteristics:

  • Regulatory mandate (delta neutrality)
  • Physical reality (time decay)
  • Risk limits (gamma explosion)
  • Contractual obligation (settlement rules)

Examples:

  • Gamma positioning (regulatory requirement)
  • Stock pinning (time decay + delta hedging)
  • 0DTE hedging (concentrated expiration risk)

Expected LLM Behavior: High detection rate even with obfuscation

Narrative Patterns ❌

Definition: Patterns requiring temporal/contextual knowledge

Characteristics:

  • Time-dependent (knowing "Friday 3:30 PM")
  • Event-driven (FOMC meetings, earnings)
  • Statistical anomalies (volume spikes without mechanism)
  • Context-dependent (works sometimes, not always)

Examples:

  • "Friday 3:30 squeeze" (requires knowing day of week)
  • "FOMC drift" (requires knowing FOMC dates)
  • "Volume anomaly" (no mechanical constraint)

Expected LLM Behavior: Low detection rate with obfuscation (reveals memorization)


Success Criteria

Detection Rate

Metric: Percentage of test days where LLM correctly identifies constraint

Threshold: ≥60% detection rate (significantly better than random)

Interpretation:

  • 100%: Perfect mechanical understanding
  • 60-80%: Strong structural detection
  • <60%: Pattern may be narrative, not mechanical

Predictive Accuracy

Metric: Percentage of predictions that materialize

Calculation:

# LLM predicts: "Dealers forced to buy, expect upward pressure"
# Verification: Check if SPY actually moved up on T+1

if prediction.direction == "UP" and forward_return > 0:
    prediction_correct = True

Threshold: ≥80% accuracy (predictions must materialize)

Interpretation:

  • High accuracy: LLM understands causal mechanism
  • Low accuracy: LLM detecting pattern that doesn't drive price action

Sample Size

Requirement: Minimum 30 samples per pattern

Rationale: Statistical significance

Our Implementation: 242 trading days × 3 patterns = 726 tests


Obfuscation Implementation

DataObfuscator Class

Located: src/data_sources/data_obfuscator.py

Key Features:

  1. Date Obfuscation: Maps real dates → "Day T+X" format
  2. Ticker Obfuscation: Maps "SPY" → "INDEX_1"
  3. Consistency: Same asset always gets same obfuscated name within experiment
  4. Reversibility: Maintains mapping for verification

Example Usage:

from src.data_sources.data_obfuscator import DataObfuscator

obfuscator = DataObfuscator()

# Obfuscate data
obfuscated = obfuscator.obfuscate_data(
    gex_data=gex_results,
    test_date=datetime(2024, 1, 2),
    ticker="SPY"
)

# LLM sees:
# Day T+0, INDEX_1, Net GEX: -$8.95B

Prompt Template (Obfuscated)

OBFUSCATED_PROMPT = """
You are analyzing options market mechanics on {obfuscated_date}.

**Market Data** (Asset: {obfuscated_ticker}):
- Spot Price: ${spot_price:.2f}
- Net GEX: ${net_gex_billions:.2f}B
- GEX Distribution: {gex_distribution}

**Question**: WHO is forcing WHOM to do WHAT?

**Requirements**:
1. Identify market participants and their constraints
2. Explain the FORCING mechanism (regulation, risk, physics)
3. Predict what actions are FORCED (not chosen)
4. Assign confidence (0-100%)

**No real dates, tickers, or events are provided. Reason from structure alone.**
"""

Verification Process

Step 1: Detection

Run LLM on obfuscated data → Did it detect the constraint?

Pass: LLM identifies dealers, gamma hedging, forced buying/selling

Fail: LLM says "no pattern" or detects wrong constraint

Step 2: Accuracy

Check if prediction materialized using forward returns

Data: OutcomeCalculator computes T+1, T+3 forward returns

Verification:

# LLM predicted "dealers forced to buy → upward pressure"
if llm_prediction == "UP" and forward_return_t1 > 0:
    accurate = True

Step 3: Aggregation

Compute detection rate and accuracy across all test days

Output: YAML validation report

Example:

pattern_name: gamma_positioning
detection_rate_pct: 100.0
predictive_accuracy_pct: 96.2
sample_size: 53

Critical Finding: Detection-Profitability Divergence

The Experiment

Validated 3 patterns across full 2024 (Q1, Q3, Q4 = 181 days)

Result:

  • Detection: Stable (84-100%) across all quarters
  • Accuracy: High (87-98%) across all quarters
  • Profitability: Declined from +21-70 bps (Q1) → -1 to +5 bps (Q4)

Why This Matters

If LLM was memorizing profits: Detection would decline with profitability

Observed behavior: Detection stable while profits disappeared

Conclusion: LLM detects structural constraints, not profit opportunities

Implication: Methodology successfully prevents temporal context leakage


Comparison: Biased vs Unbiased Prompts

Discovery (Issue #90)

Initial testing used biased prompts that assumed pattern existed:

Biased Prompt (Q3+Q4 2024, 128 days):

"Dealers are short gamma. What are they forced to do?"

Results:

  • Detection: 100% (by design—prompt assumes pattern)
  • Accuracy: 87.5-93.0% (predictions materialized)

Fix: Unbiased Prompts

Unbiased Prompt (Full 2024, 242 days):

"Analyze this market data. WHO is forcing WHOM to do WHAT? If no pattern exists, say so."

Results:

  • Detection: 67.4-77.7% (realistic detection rate)
  • Accuracy: 86.5-98.4% (similar accuracy)

Key Insight: Biased prompts inflate detection but don't affect accuracy (predictions still materialize)


Limitations

What Obfuscation Doesn't Solve

  1. LLM training on methodology: If LLM trained on this exact obfuscation approach
  2. Common pattern names: "Gamma positioning" may trigger memorized associations
  3. Indirect temporal clues: GEX magnitudes might correlate with time periods

Mitigations

  1. Test multiple pattern framings: Same constraint, different narratives
  2. Vary obfuscation schemes: T+0 vs Day 1 vs Test Day
  3. Compare to random data: Ensure LLM doesn't hallucinate patterns

Future Work

  • Comparative LLMs: Test GPT-4, Claude, o3-mini (do all detect same patterns?)
  • Blind validation: Present random GEX data, check false positive rate
  • Formal verification: Mathematical proof of constraint existence

Reproducibility

Full Methodology Available

Code: All obfuscation logic in src/data_sources/data_obfuscator.py

Validation Scripts: scripts/validation/validate_pattern_taxonomy.py

Prompts: Documented in src/agents/market_mechanics_agent.py

Results: YAML files in reports/validation/pattern_taxonomy/

Running Your Own Validation

# Set up environment
export PYTHONPATH=$(pwd):$PYTHONPATH
export OPENAI_API_KEY="your-key"

# Run obfuscated validation
python scripts/validation/validate_pattern_taxonomy.py \
  --pattern gamma_positioning \
  --symbol SPY \
  --start-date 2024-01-02 \
  --end-date 2024-03-29 \
  --confidence 60.0

Output: YAML report with detection rate, accuracy, sample details


References

See docs/guides/ for detailed conceptual explanations:

  • obfuscation-testing-explained.md - In-depth methodology guide
  • gex-metrics-explained.md - Gamma exposure calculations
  • validation-framework.md - Pattern taxonomy and success criteria

Last Updated: October 25, 2025

Clone this wiki locally