-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
data-pipelineData collection and processing tasksData collection and processing tasksllm-trainingLLM pattern detection workLLM pattern detection work
Description
Overview
Design and implement dynamic tokenization system in src/tokenization/ to convert continuous market metrics into discrete sequences optimized for LLM pattern analysis.
Tasks
- Create tokenization modules in src/tokenization/
- Implement adaptive binning based on rolling percentiles
- Handle multiple data types (price, GEX, volume)
- Generate consistent token vocabulary with clear definitions
- Create multi-timeframe sequence generator
- Build temporal sequences of varying lengths (5, 10, 20 days)
- Include context tokens (days_to_opex, days_since_fomc)
- Handle missing/sparse data gracefully
- Optimize for GPT-4o-mini/GPT-4o context window limits
- Integrate with existing data obfuscation tools (src/validation/)
- Implement sequence validation framework
Token Vocabulary Design
# GEX States (percentile-based)
GEX_TOKENS = [
'GEX_EXTREME_NEG', # < 10th percentile
'GEX_MOD_NEG', # 10-40th percentile
'GEX_NEUTRAL', # 40-60th percentile
'GEX_MOD_POS', # 60-90th percentile
'GEX_EXTREME_POS' # > 90th percentile
]
# Price Movement States
PRICE_TOKENS = [
'CRASH', # < -3%
'BIG_DOWN', # -3% to -1%
'SMALL_DOWN', # -1% to -0.25%
'FLAT', # -0.25% to 0.25%
'SMALL_UP', # 0.25% to 1%
'BIG_UP', # 1% to 3%
'MOON' # > 3%
]
# Market Events
EVENT_TOKENS = [
'CROSS_FLIP', # GEX crosses zero
'BREAK_CALL_WALL', # Price breaks above call wall
'BREAK_PUT_SUPPORT', # Price breaks below put support
'VOL_SPIKE', # VIX > 20% daily move
'OPEX_WEEK', # Options expiration week
'FOMC_WEEK' # Federal Reserve meeting week
]Sequence Builder Format
# Example sequences for pattern mining
sequences = [
['GEX_NEG', 'CROSS_FLIP', 'BIG_DOWN', 'VOL_SPIKE', '->', 'CRASH'],
['GEX_EXTREME_POS', 'BREAK_CALL_WALL', 'OPEX_WEEK', '->', 'BIG_DOWN'],
['GEX_NEUTRAL', 'FOMC_WEEK', 'FLAT', 'FLAT', '->', 'VOL_SPIKE']
]Adaptive Binning Implementation
- Use rolling 252-day windows for percentile calculation
- Update thresholds monthly to adapt to changing market regimes
- Handle regime changes (2020 COVID, 2022 rate hikes)
- Validate token stability over time
Acceptance Criteria
- Clean modular structure in src/tokenization/ directory
- Well-defined token vocabulary with statistical backing
- Multi-timeframe sequence generation capability (5, 10, 20 day lookbacks)
- Robust handling of data gaps and missing values
- Optimized token efficiency for LLM context windows
- Integration with obfuscation tools for research integrity
- Context-aware sequences including market events
- Validation of sequence integrity and meaning
- Comprehensive documentation of tokenization schema
Implementation Notes
- Directory: src/tokenization/ (created during reorganization)
- Integrate with src/validation/data_obfuscation.py for LLM testing
- Target models: GPT-4o-mini (primary), GPT-4o (fallback)
- Focus on dealer hedging patterns in tokenized sequences
- Enable multi-timeframe pattern discovery beyond single indicators
- Use existing cache system for performance optimization
Metadata
Metadata
Assignees
Labels
data-pipelineData collection and processing tasksData collection and processing tasksllm-trainingLLM pattern detection workLLM pattern detection work