Skip to content

🟡 Tokenization System for LLM Input #5

@iAmGiG

Description

@iAmGiG

Overview

Design and implement dynamic tokenization system in src/tokenization/ to convert continuous market metrics into discrete sequences optimized for LLM pattern analysis.

Tasks

  • Create tokenization modules in src/tokenization/
  • Implement adaptive binning based on rolling percentiles
  • Handle multiple data types (price, GEX, volume)
  • Generate consistent token vocabulary with clear definitions
  • Create multi-timeframe sequence generator
  • Build temporal sequences of varying lengths (5, 10, 20 days)
  • Include context tokens (days_to_opex, days_since_fomc)
  • Handle missing/sparse data gracefully
  • Optimize for GPT-4o-mini/GPT-4o context window limits
  • Integrate with existing data obfuscation tools (src/validation/)
  • Implement sequence validation framework

Token Vocabulary Design

# GEX States (percentile-based)
GEX_TOKENS = [
    'GEX_EXTREME_NEG',  # < 10th percentile
    'GEX_MOD_NEG',      # 10-40th percentile  
    'GEX_NEUTRAL',      # 40-60th percentile
    'GEX_MOD_POS',      # 60-90th percentile
    'GEX_EXTREME_POS'   # > 90th percentile
]

# Price Movement States
PRICE_TOKENS = [
    'CRASH',      # < -3%
    'BIG_DOWN',   # -3% to -1%
    'SMALL_DOWN', # -1% to -0.25%
    'FLAT',       # -0.25% to 0.25%
    'SMALL_UP',   # 0.25% to 1%
    'BIG_UP',     # 1% to 3%
    'MOON'        # > 3%
]

# Market Events
EVENT_TOKENS = [
    'CROSS_FLIP',        # GEX crosses zero
    'BREAK_CALL_WALL',   # Price breaks above call wall
    'BREAK_PUT_SUPPORT', # Price breaks below put support
    'VOL_SPIKE',         # VIX > 20% daily move
    'OPEX_WEEK',         # Options expiration week
    'FOMC_WEEK'          # Federal Reserve meeting week
]

Sequence Builder Format

# Example sequences for pattern mining
sequences = [
    ['GEX_NEG', 'CROSS_FLIP', 'BIG_DOWN', 'VOL_SPIKE', '->', 'CRASH'],
    ['GEX_EXTREME_POS', 'BREAK_CALL_WALL', 'OPEX_WEEK', '->', 'BIG_DOWN'],
    ['GEX_NEUTRAL', 'FOMC_WEEK', 'FLAT', 'FLAT', '->', 'VOL_SPIKE']
]

Adaptive Binning Implementation

  • Use rolling 252-day windows for percentile calculation
  • Update thresholds monthly to adapt to changing market regimes
  • Handle regime changes (2020 COVID, 2022 rate hikes)
  • Validate token stability over time

Acceptance Criteria

  • Clean modular structure in src/tokenization/ directory
  • Well-defined token vocabulary with statistical backing
  • Multi-timeframe sequence generation capability (5, 10, 20 day lookbacks)
  • Robust handling of data gaps and missing values
  • Optimized token efficiency for LLM context windows
  • Integration with obfuscation tools for research integrity
  • Context-aware sequences including market events
  • Validation of sequence integrity and meaning
  • Comprehensive documentation of tokenization schema

Implementation Notes

  • Directory: src/tokenization/ (created during reorganization)
  • Integrate with src/validation/data_obfuscation.py for LLM testing
  • Target models: GPT-4o-mini (primary), GPT-4o (fallback)
  • Focus on dealer hedging patterns in tokenized sequences
  • Enable multi-timeframe pattern discovery beyond single indicators
  • Use existing cache system for performance optimization

Metadata

Metadata

Assignees

Labels

data-pipelineData collection and processing tasksllm-trainingLLM pattern detection work

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions