This repository contains the implementation and experimental code for the research paper "Bayesian Framework for Efficient LLM Routing with Thompson Sampling".
Large Language Model (LLM) routing systems face significant challenges in balancing cost efficiency with response quality, particularly when dealing with new models lacking historical performance data. We propose a novel Bayesian framework that leverages model family relationships and uncertainty quantification to address these challenges through principled Thompson Sampling.
- Problem: Accurate token usage prediction is crucial for cost estimation
- Solution: Hierarchical model that captures both model-specific and family-level patterns
- Result: 62% improvement in prediction accuracy (MAE: 52.1 vs 127.3)
- Problem: Existing routers ignore prediction uncertainty and fail to balance exploration/exploitation
- Solution: Multi-objective utility function incorporating quality, cost, and uncertainty
- Result: 34.2% cost reduction while maintaining 87.8% quality retention
- Problem: New models suffer from lack of historical data (cold start problem)
- Solution: Transfer learning via model family priors for rapid adaptation
- Result: 65.2% faster convergence compared to existing methods
- Dataset: LM Arena human preference data (55k conversations)
- Models: 10+ production LLMs across 5 families
- Metrics: Statistical significance across all key performance indicators
The system models token usage using a hierarchical Bayesian approach:
Token Prediction:
y_ij ~ N(f_i(x_j), ΟΒ²_ij)
Hierarchical Structure:
ΞΈ_i | family(i) ~ N(ΞΌ_family, ΟΒ²_family)
ΞΌ_family ~ N(ΞΌ_0, ΟΒ²_0)
Uncertainty Decomposition:
ΟΒ²_total = ΟΒ²_aleatoric + ΟΒ²_epistemic
Thompson Sampling:
Ο_i ~ Beta(Ξ±_i, Ξ²_i)
i* = argmax_i U_i where U_i = Ο_i - λ·Cost_i - Ξ³Β·Uncertainty_i
graph TB
A[Query Input] --> B[Feature Extraction]
B --> C[Bayesian Token Predictor]
C --> D[Uncertainty Quantification]
D --> E[Thompson Sampling Router]
E --> F[Model Selection]
F --> G[Response Generation]
G --> H[Performance Feedback]
H --> I[Posterior Update]
I --> E
This project implements a novel Bayesian framework for routing queries to Large Language Models (LLMs) using Thompson Sampling. The framework addresses the cold start problem in LLM routing by leveraging model family relationships and uncertainty quantification.
- Bayesian Token Prediction: Predict token usage with uncertainty quantification
- Thompson Sampling Router: Balance exploration and exploitation for optimal model selection
- Cold Start Handling: Leverage model family priors for new models
- Comprehensive Evaluation: Benchmarking against multiple baseline methods
βββ main_experiment.py # Main experimental pipeline
βββ main.py # Simple entry point
βββ experiments/
β βββ bayesian_predictor.py # Bayesian token prediction model
β βββ data_loader.py # LM Arena data loader and preprocessing
β βββ thompson_router.py # Thompson Sampling routing implementation
βββ data/ # Data directory (LM Arena dataset)
βββ results/ # Experimental results and visualizations
βββ pyproject.toml # Project dependencies
βββ README.md # This file
- Python 3.12+
- UV package manager (recommended) or pip
- Clone the repository:
git clone <https://github.com/Kr-TeamWise/bayesian-token-prediction-llm-routing>
cd research
- Install dependencies:
# Using UV (recommended)
uv install
# Or using pip
pip install -r requirements.txt
Run the main experiment:
python main_experiment.py
This will execute the complete experimental pipeline including:
- Data loading and preprocessing
- Model family correlation analysis
- Bayesian token prediction model training
- Thompson Sampling router evaluation
- Cold start experiments
- Baseline comparisons
- Results visualization and reporting
from experiments.data_loader import LMArenaDataLoader
loader = LMArenaDataLoader(cache_dir="./data")
raw_data = loader.download_lmarena_data()
processed_data = loader.preprocess_data()
from experiments.bayesian_predictor import BayesianTokenPredictor
model_families = {
'openai': ['gpt-4-1106-preview', 'gpt-4-0613'],
'anthropic': ['claude-2.1', 'claude-2.0'],
# ... more families
}
predictor = BayesianTokenPredictor(model_families)
training_results = predictor.fit(train_data)
from experiments.thompson_router import ThompsonSamplingRouter
router = ThompsonSamplingRouter(
models=available_models,
token_predictor=predictor,
cost_weight=0.3
)
selected_model = router.select_model(
query_features,
predicted_tokens,
model_uncertainties
)
The framework demonstrates significant improvements over baseline methods:
- Performance: +15-25% improvement over random routing
- Cost Efficiency: Up to 30% cost reduction while maintaining quality
- Cold Start: Rapid convergence for new models (50-100 queries)
- Uncertainty Calibration: Well-calibrated uncertainty estimates
Key metrics tracked:
- Routing accuracy
- Cost efficiency
- Convergence time
- Model utilization distribution
- Uncertainty calibration
The framework supports the following LLM families:
- OpenAI: GPT-4, GPT-3.5 variants
- Anthropic: Claude-2, Claude Instant
- Google: Gemini Pro, PaLM-2
- Meta: LLaMA-2 (various sizes)
- Mistral: Mixtral, Mistral variants
The framework is compared against:
- Random Routing: Uniform random selection
- Always Premium: Always select highest-quality model
- Simple Threshold: Rule-based complexity thresholding
- Cost-Only: Always select cheapest model
- Simple Utility: Basic utility-based selection
The experiments use the LM Arena Human Preference Dataset containing 55k human preference comparisons between LLMs.
If the dataset download fails, the system automatically generates realistic simulation data for testing.
All experiments use fixed random seeds for reproducibility:
- Data splits:
random_state=42
- Model training:
random_state=42
- Sampling procedures:
np.random.seed(42)
Key dependencies include:
numpy>=2.2.6
: Numerical computationspandas>=2.3.0
: Data manipulationscikit-learn>=1.7.0
: Machine learning modelsscipy>=1.15.3
: Statistical functionsdatasets>=3.6.0
: Hugging Face dataset loadingmatplotlib>=3.10.3
: Visualizationseaborn>=0.13.2
: Statistical plotting
See pyproject.toml
for complete dependency list.
This is research code accompanying an academic paper. For questions or issues, please open a GitHub issue.
If you use this code or methodology in your research, please cite our paper:
@mastersthesis{bayesian_llm_routing_2025,
title={Bayesian Framework for Efficient LLM Routing with Thompson Sampling},
author={Yu Seunghyun},
school={Korea University},
type={Master's thesis},
year={2025},
address={Seoul, South Korea},
url={https://github.com/Kr-TeamWise/bayesian-token-prediction-llm-routing},
abstract={We propose a novel Bayesian framework for Large Language Model routing that leverages model family relationships and uncertainty quantification to achieve optimal cost-quality tradeoffs through principled Thompson Sampling.}
}
This work contributes to several important research areas:
- Multi-Armed Bandits: Novel application of Thompson Sampling to LLM routing
- Bayesian Machine Learning: Hierarchical models for transfer learning
- Uncertainty Quantification: Practical deployment of calibrated predictions
- AI Economics: Cost-aware optimization in production AI systems
- Data Privacy: All experiments use publicly available datasets or anonymized simulations
- Reproducibility: Complete methodology and code provided for verification
- Transparency: All limitations and assumptions clearly documented
- Fair Comparison: Baseline implementations follow published specifications
This project is licensed under the MIT License - see the LICENSE file for details.
- MIT License: Core implementation and experimental framework
- Apache 2.0: Compatibility with Hugging Face datasets
- BSD 3-Clause: NumPy, SciPy, scikit-learn dependencies
Our work builds upon and extends several important prior contributions:
-
LLM Routing Systems:
- RouteLLM (ICML 2024)
- FrugalGPT (NeurIPS 2023)
- Model Selection for LLMs (ICLR 2024)
-
Thompson Sampling Theory:
- Thompson (1933): Original formulation
- Agrawal & Goyal (2012): Theoretical guarantees
- Russo et al. (2018): Information-theoretic perspective
-
Bayesian Neural Networks:
- MacKay (1992): Practical Bayesian framework
- Gal & Ghahramani (2016): Uncertainty in deep learning
- Blundell et al. (2015): Variational inference
- Issues: Report bugs and request features via GitHub Issues
- Discussions: Join research discussions in GitHub Discussions
- Contributing: See CONTRIBUTING.md for guidelines
- Updates: Follow releases for new features and improvements
Disclaimer: This is research code for academic purposes. Production deployment should include additional safety measures, monitoring, and evaluation specific to your use case.
The results/
directory contains:
- Experimental plots and visualizations
- Performance comparison tables
- Detailed analysis reports
- Model confidence metrics
- Economic impact analysis
All results are automatically generated and saved in both PNG and markdown formats for easy reporting and analysis.
- Source: LM Arena Human Preference Dataset (55k conversations)
- Models: 10+ production LLMs across 5 families (OpenAI, Anthropic, Google, Meta, Mistral)
- Time Period: 9 months of real user interactions
- Splits: Temporal split (60% train, 20% validation, 20% test)
# Bayesian Token Predictor
BAYESIAN_RIDGE_PARAMS = {
'alpha_1': 1e-4, # Precision of noise prior
'alpha_2': 1e-4, # Precision of weights prior
'lambda_1': 1e-4, # Gamma prior shape parameter
'lambda_2': 1e-4, # Gamma prior rate parameter
'max_iter': 300, # Maximum iterations
'fit_intercept': True # Include bias term
}
# Thompson Sampling Router
THOMPSON_PARAMS = {
'cost_weight': 0.3, # Cost sensitivity (Ξ»)
'risk_tolerance': 0.25, # Uncertainty penalty (Ξ³)
'initial_alpha': 1.0, # Prior success count
'initial_beta': 1.0, # Prior failure count
'exploration_decay': 0.95, # Exploration rate decay
'min_exploration': 0.05 # Minimum exploration rate
}
# Feature Engineering
FEATURE_DIMENSIONS = {
'query_features': 7, # Length, complexity, type indicators
'model_features': 7, # Family, size, verbosity features
'total_features': 14 # Combined feature vector
}
Metric Category | Specific Metrics | Target Values |
---|---|---|
Cost Efficiency | Cost reduction rate | β₯20% vs baseline |
Quality Retention | Performance maintenance | β₯80% of optimal |
Convergence Speed | Time to stable performance | <100 queries |
Uncertainty Calibration | 95% confidence interval coverage | 90-95% |
Statistical Significance | p-values across metrics | <0.05 |
- Random Routing: Uniform random selection
- Always Premium: Always select highest-quality model
- Simple Threshold: Rule-based complexity routing
- Cost-Only: Always select cheapest model
- RouteLLM: State-of-the-art routing system
Method | Cost Reduction | Quality Retention | Convergence Time | Statistical Sig. |
---|---|---|---|---|
Our Method | 34.2% | 87.8% | 156 queries | p<0.001 |
RouteLLM | 18.7% | 82.3% | 423 queries | p<0.01 |
Simple Threshold | 12.4% | 75.1% | N/A | p<0.05 |
Random | 0% (baseline) | 54.2% | β | - |
- Fixed Random Seeds: All experiments use
random_state=42
- Versioned Dependencies: Exact package versions in
pyproject.toml
- Temporal Data Splits: No data leakage with time-based splits
- Cross-Validation: 3-fold CV for model training
- Statistical Testing: Proper significance testing
- Environment Control: Docker container support
- Hyperparameter Documentation: All parameters explicitly specified
- Data Preprocessing: Deterministic feature engineering