-
Notifications
You must be signed in to change notification settings - Fork 0
RoadMap
Status: Paper #1 submitted (Oct 26, 2025), Papers #2-3 planned
This roadmap outlines the multi-paper research trajectory for validating LLM understanding of market microstructure constraints through obfuscation testing.
Core Methodology: Obfuscation testing framework (strip temporal context, force reasoning from structure)
Test Domain: Options market dealer constraints (gamma exposure hedging)
Key Innovation: Rigorous validation that distinguishes understanding from memorization
| Paper | Status | Timeline | Contribution |
|---|---|---|---|
| Paper #1 | ✅ Submitted | Oct 2025 | Baseline obfuscation methodology (single-day, SPY) |
| Paper #2 | 📋 Planned | Q1 2026 | Temporal dynamics (sequential GEX analysis) |
| Paper #3 | 📋 Planned | Q2 2026 | Cross-asset generalization (individual equities) |
| Paper #4+ | 💭 Future | 2026+ | Pattern discovery, comparative LLMs, hybrid systems |
Title: "Validating Large Language Model Understanding of Market Microstructure Through Obfuscation Testing"
Target: LLM-Finance 2025 Workshop @ IEEE BigData 2025
Contribution:
- Novel obfuscation testing framework for LLM validation
- Proof that LLMs can detect structural dealer constraints without temporal context
- Multi-pattern validation (3 patterns, 242 days, 726 tests)
Key Results:
- Detection: 71.5% average (unbiased prompts across 3 patterns)
- Accuracy: 91.2% (predictions materialize)
- Validation: Full 2024 (242 trading days per pattern)
Critical Finding: Detection-Profitability Divergence
- LLM detection stable (84-100%) while profitability → 0% (Q1→Q4 2024)
- Proves detection based on structure, not profit optimization
- Validates methodology prevents temporal context leakage
Documentation: Paper #1 Full Content
GitHub Issues: #88 (Status Tracking)
Title (Proposed): "Temporal Dynamics of Dealer Constraints: Sequential Gamma Exposure Analysis with LLMs"
Target: Journal submission (6-8 pages)
"Currently you are looking on single day gamma exposure, will it be worthy look at most recent 5 days to detect the hidden force? I mean the sequential changes of gamma exposure would bring more info on dealers intention. This could be a next more comprehensive paper even before going to individual stocks"
- Can LLMs detect constraint trajectories (not just snapshots)?
- Does sequential context improve predictive accuracy over single-day?
- What temporal patterns emerge in dealer hedging behavior?
5-Day Lookback Windows:
- Day T-4 to Day T+0 (sequential context)
- Maintain obfuscation: "Day T-4", "Day T-3", etc. (no real dates)
- New pattern taxonomy: Accumulation, relief, reversal, persistence
Example Sequential Prompt:
Sequential GEX Data (Day T-4 to Day T+0):
Day T-4: Net GEX: -$2.1B (negative gamma)
Day T-3: Net GEX: -$3.2B (negative gamma INCREASING)
Day T-2: Net GEX: -$4.1B (negative gamma INCREASING)
Day T-1: Net GEX: -$4.8B (negative gamma INCREASING)
Day T+0: Net GEX: -$5.2B (negative gamma PEAK)
Trajectory: Escalating short gamma over 5 days (-$2.1B → -$5.2B)
WHO is forcing WHOM to do WHAT?
Consider the TRAJECTORY of constraints, not just current state.
- Temporal extension of obfuscation testing framework
- Evidence that LLMs reason about constraint trajectories
- Pattern taxonomy expansion (4 new temporal pattern types)
- Predictive accuracy improvement via sequential context
- Best case: Accuracy improves 91% → 96% (sequential adds value)
- Neutral: Similar accuracy (single-day sufficient)
- Worst case: Accuracy decreases (sequential confuses LLM)
- Phase 1 (Day 1): Database query extension (5-day windows)
- Phase 2 (Day 2): Sequential prompt template
- Phase 3 (Days 3-4): Validation runs (169 5-day windows on SPY 2024)
- Phase 4 (Day 5): Comparative analysis (single vs sequential)
Dataset: SPY 2024 (existing data, no new collection needed)
Estimated Effort: 5 days implementation + 2-3 weeks analysis/writing
GitHub Issue: #89 (Sequential GEX Analysis)
Dependency: Paper #1 acceptance/publication
Title (Proposed): "Cross-Asset Validation of LLM Market Microstructure Understanding"
Target: Journal submission (8-10 pages)
- Does obfuscation testing generalize beyond SPY index options?
- Do dealer constraints differ between index and single-name options?
- Can LLMs detect stock-specific vs market-wide patterns?
Test on 10-20 Individual Stocks:
- High liquidity: AAPL, MSFT, NVDA, TSLA, etc.
- Use sequential analysis if Paper #2 validates it
- Compare dealer dynamics: Index (SPY) vs single-name
Key Differences (Index vs Single-Name):
- Index options: Broader dealer base, market-making focus
- Single-name options: Concentrated positions, hedging focus
- Gamma dynamics: SPY has constant 0DTE volume, stocks vary
- Liquidity: SPY ultra-liquid, individual stocks more fragmented
- Full generalization proof (methodology works beyond single asset)
- Cross-asset comparison (index vs single-name dealer dynamics)
- Pattern persistence analysis (universal vs asset-specific constraints)
- Combined temporal + cross-asset validation (if Paper #2 successful)
- Individual stock options data (2024)
- ~10-20 stocks × 242 days = ~2,420-4,840 tests
- Higher data collection effort than Paper #2
Estimated Effort:
- 1-2 weeks data collection (individual stocks)
- 1 week validation runs
- 2-3 weeks analysis/writing
GitHub Issue: #6 (Cross-asset validation) - relates to Paper #3
Dependencies:
- Paper #1 acceptance
- Paper #2 submission (determine if sequential method validated)
Research Question: Can LLMs discover novel patterns (not just validate known ones)?
Methodology:
- Unsupervised pattern mining with LLMs
- Move from validation → discovery
- Different evaluation framework (data mining risks)
Challenges:
- Requires different validation methodology
- Higher risk of false positives
- Need expert validation for novel patterns
Status: Deferred to Paper #4+ (fundamentally different problem class)
Research Question: How do different LLM architectures perform on constraint detection?
Methodology:
- Test multiple LLMs: GPT-4, o3-mini, Claude, open-source models
- Reasoning capabilities comparison
- Structured output quality assessment
Key Comparison: Reasoning models (o3-mini) vs standard models (GPT-4)
- Hypothesis: Explicit reasoning improves causal identification
Status: Medium-term (requires o3-mini availability)
Research Question: Are LLM confidence scores well-calibrated to empirical accuracy?
Methodology:
- Compare stated confidence to prediction materialization rates
- Develop post-processing calibration adjustments if needed
- Test across sequential and cross-asset contexts
Status: Analysis component (fold into Paper #2 or #3, not standalone)
Research Question: Can we combine formal verification + LLM reasoning?
Methodology:
- Formal methods: Prove constraint properties mathematically
- LLM reasoning: Assess practical materialization from context
- Complementary strengths → robust validation
Status: Long-term vision (2026+)
Research Question: Can obfuscation-validated LLMs monitor markets in real-time?
Application:
- Automated constraint detection
- Explainable alerts (WHO→WHOM→WHAT)
- Regulatory reporting (market structure surveillance)
Status: Long-term (requires production infrastructure)
Original proposal: Explain why profitability declined Q1→Q4 2024 despite stable detection
Status: SUPERSEDED - Fold into Paper #2 discussion section
Rationale: Interesting but not core methodology contribution. Sequential analysis may naturally explain regime changes.
Original proposal: Paper #3 focused on unsupervised pattern mining
Status: DEFERRED to Paper #4+
Rationale: Advisor sequence ("before going to individual stocks") prioritizes cross-asset generalization. Pattern discovery is fundamentally different problem requiring different validation framework.
Decision: Proceed with Paper #2 (Sequential GEX) implementation
- Timeline: Start immediately after acceptance notification
- Effort: 5 days implementation + 2-3 weeks writing
- Risk: Low (uses existing data)
Decision 1: Include sequential in Paper #2 or defer?
- If accuracy improves: Paper #2 focuses on sequential methodology
- If neutral/worse: Fold into Paper #1 discussion, proceed to Paper #3 without sequential
Decision 2: Timeline for Paper #3
- If Paper #2 quick: Start Paper #3 data collection in parallel with Paper #2 writing
- If Paper #2 delayed: Sequential start (finish Paper #2, then start Paper #3)
Decision: After Papers #2-3 complete
Assess which long-term direction has most impact:
- Pattern discovery (high risk, high reward)
- Comparative LLMs (medium risk, clear contribution)
- Hybrid systems (long-term vision)
- Real-time applications (practical impact)
Paper #1 (Workshop):
- LLM-Finance 2025 Workshop @ IEEE BigData 2025
- Deadline: October 26, 2025 ✅
- Format: 4-6 pages workshop paper
Paper #2 (Journal):
- Target: Journal of Financial Markets, Journal of Finance, or similar
- Format: 6-8 pages journal article
- Timeline: Q1 2026 submission
Paper #3 (Journal):
- Target: Same tier as Paper #2
- Format: 8-10 pages (larger scope with cross-asset)
- Timeline: Q2 2026 submission
Paper #4+ (Journal/Conference):
- Depends on direction chosen
- Timeline: 2026+
Consider presenting at:
- AFA (American Finance Association)
- WFA (Western Finance Association)
- MFA (Midwest Finance Association)
- NeurIPS (ML track)
- ICML (Finance + ML)
Throughout all papers, maintain:
- Obfuscation rigor: Always strip temporal context
- WHO→WHOM→WHAT: Explicit causal identification
- Academic honesty: Report failures and limitations
- Reproducibility: All code/data documented
- Generalization: Prove methodology scales beyond cherry-picked examples
| Date | Milestone |
|---|---|
| ✅ Oct 26, 2025 | Paper #1 submitted |
| Nov-Dec 2025 | Paper #1 review period |
| Jan 2026 | Start Paper #2 (sequential GEX) |
| Q1 2026 | Paper #2 submission |
| Q2 2026 | Paper #3 submission (cross-asset) |
| 2026+ | Paper #4+ (discovery/comparative/hybrid) |
Key Dependency: Paper #1 acceptance gates Paper #2 timeline. If acceptance delayed, adjust subsequent timelines accordingly.
Full Details: See docs/papers/research_roadmap.md in repository
Last Updated: October 25, 2025