Paper-faithful discrete-event simulator implementing exact algorithms from:
- Multi-Bin Batching for LLM Inference Throughput Optimization
- Memory-Aware and SLA-Constrained Dynamic Batching for LLM Inference
β
Validated with Real BurstGPT Dataset (Azure traces with 1.4M+ requests)
β
GPU Calibration Ready (RTX 4080 + CUDA 12.6 + Transformers/vLLM)
β
Three Scheduler Modes (static_fifo, dynamic_no_bins, multi_bin_dynamic)
β
Performance Optimized (3-10x speedup with workload/bin caching)
β
Bug Fixed (K-bins sensitivity tests now work for K=8,16,32)
pip install -r requirements.txtRequired packages: numpy, pandas, matplotlib, scipy, tqdm
β‘ NEW: Optimized Version Available (3-10x faster!)
# OPTIMIZED VERSION (recommended - 3-10x speedup)
python scripts/comprehensive_stress_test_optimized.py
# Original version (still works, but slower)
python scripts/comprehensive_stress_test.pySee OPTIMIZATION_SUMMARY.md for performance details!
Individual steps:
# Step 1 only: Request scaling 1Kβ1M (multi-bin tested with 1,2,4 GPUs)
python scripts/comprehensive_stress_test_optimized.py --step1-only
# Step 2 only: GPU scaling 1-100 GPUs for 1M requests
python scripts/comprehensive_stress_test_optimized.py --step2-only
# Step 3 only: K-bins sensitivity analysis (1,2,4,8,16,32)
python scripts/comprehensive_stress_test_optimized.py --step3-only --best-gpu-count 32# Quick comparison of all schedulers
python scripts/run_mb_dynamic.py --compare --num-requests 1000
# Realistic benchmarking with REAL timestamps (low pressure)
python scripts/comprehensive_stress_test_optimized.py --use-real-timestamps --max-requests 100000Documentation:
- COMPREHENSIVE_STRESS_TEST_3STEP.md - Full test suite guide
- OPTIMIZATION_SUMMARY.md - Performance optimizations (3-10x speedup)
- BUGFIX_KBINS.md - K-bins sensitivity fix (K=8,16,32)
- BUGFIX_INCREMENTAL_SAVE.md - Incremental workflow fix
Expected Performance:
- Full test suite (39 tests): ~24 min (optimized) vs ~33 min (original)
- Step 1 (25 tests): ~4 min (optimized) vs ~6 min (original)
- Workload caching: 25x faster (load once vs load per test)
- Incremental workflow: Run steps individually, results accumulate
From Multi-Bin Batching Paper:
- Equal-mass bin boundaries (empirical quantiles)
- Fixed batch size B for paper validation
- Throughput scaling with K_BINS
From Dynamic Batching Paper:
- Algorithm 1: Memory constraint
b_mem = β(Ξ·-Lβ)/ΞΌβ - Algorithm 2: SLA controller with adaptive
[b_low, b_high]search - Final:
b_target = min(b_mem, b_SLA)
Additional:
- Service time: max-dominates property
- Feedback loops:
update_after_batch() - Event-driven discrete-event simulation
static_fifo- Fixed batch size (B=8), no dynamic batching, baselinedynamic_no_bins- Single queue with global SLA controller + memory constraintmulti_bin_dynamic- K bins + bin-specific dynamic batching (our contribution)
The multi_bin_dynamic scheduler implements three key innovations:
-
Composition Control - Bins partition requests by predicted output length
- Bin 0: [0, 64] tokens (short)
- Bin 1: [64, 256] tokens (medium)
- Bin 2: [256, 1024] tokens (long)
- Bin 3: [1024+] tokens (very long)
-
Bin-Specific Adaptation - Each bin maintains separate controllers
- Per-bin statistics: Each bin learns its own avg_prompt_len, avg_output_len
- Per-bin SLA controllers: Each bin adapts batch size independently
- Bin 0 learns: "I can handle large batches" (fast, predictable)
- Bin 3 learns: "I need small batches" (slow, high variance)
-
Mathematical Foundation
- Bins reduce E[max(t_j) | bin] via narrower length distributions
- max(B jobs from [10, 20]) << max(B jobs from [10, 200])
- Throughput_k = B / E[T_batch,k] increases with k
- Each bin optimizes for its own characteristics
Current Setup: High-pressure stress testing with option for realistic benchmarking
| Component | Implementation | Benefit |
|---|---|---|
| Workload | BurstGPT dataset (1K-1M real Azure ChatGPT traces) | Realistic request patterns and distributions |
| Arrival Pattern | RPS Scaling 200x (stress testing mode) β | High-pressure evaluation (~54 req/s vs 0.27 real) |
| Latency Model | GPU-calibrated from RTX 4080 (Qwen3 1.7B) | Production-accurate service times |
| Configuration | 1.0s SLA (realistic LLM inference target) | Real-world constraint |
| Schedulers | Three types: static_fifo, dynamic_no_bins, multi_bin_dynamic | Clear performance differentiation |
| Validity | ββββ Maximum realism + stress testing β | Publication-ready results |
Two Testing Modes:
-
RPS Scaling (default - stress testing): Artificially compress arrival times 200x
- Use: Default (or explicit
--use-rps-scaling) - Benefit: High-pressure testing, clear scheduler differences
- Arrival rate: ~54 req/s (200x faster than real 0.27 req/s)
- Finding breaking points and performance limits
- Use: Default (or explicit
-
Real Timestamps (optional - realistic benchmarking): Preserve actual Azure patterns
- Use:
--use-real-timestamps - Benefit: Realistic bursty patterns, natural quiet periods
- Arrival rate: ~0.27 req/s (actual production rate)
- Realistic production behavior
- Use:
Why RPS Scaling by Default?
- Real arrival rate is very low (0.27 req/s = 16 req/min)
- Low pressure doesn't differentiate schedulers well (all perform similarly)
- 200x scaling creates meaningful load (~54 req/s) to find limits
- Preserves bursty patterns while increasing pressure
llm_scheduler_sim/
βββ mb_dyn_sim/ # Core simulator
β βββ config.py # Configuration + equal-mass boundaries
β βββ schedulers.py # SLAController, DynamicBatcher, MultiBinScheduler
β βββ simulation.py # Discrete-event simulation engine
β βββ workload.py # BurstGPT loading + Poisson generation
β βββ model_calibration.py # vLLM calibration support
β βββ metrics.py # Performance metrics
β βββ experiments.py # Plotting and analysis
β
βββ scripts/
β βββ run_mb_dynamic.py # Main experiment runner β
β βββ calibrate_real_gpu_transformers.py # GPU calibration (Windows)
β
βββ data/
β βββ BurstGPT_sample.csv # Real Azure traces (download)
β βββ README.md # Dataset format spec
β
βββ ARCHITECTURE.md # Complete process flow documentation β
βββ CUDA_SETUP_COMPLETE.md # GPU calibration setup guide
βββ README.md # This file
python scripts/run_mb_dynamic.py `
--arrival-profile burstgpt_dataset `
--num-requests 5000 `
--comparepython scripts/run_mb_dynamic.py --num-requests 5000 --comparefor ($K in 1,2,4,8) {
python scripts/run_mb_dynamic.py --k-bins $K --num-requests 5000
}python scripts/run_mb_dynamic.py `
--arrival-profile burstgpt_dataset `
--dataset-path data/BurstGPT_sample.csv `
--num-requests 10000 `
--compareFor Level 3 fidelity with real GPU measurements:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"# Windows (Transformers)
python scripts/calibrate_real_gpu_transformers.py --model Qwen/Qwen2.5-1.5B --trials 3
# Linux (vLLM - not supported on Windows, advanced users only)
# pip install vllm
# (Use transformers script above for Windows)python scripts/run_mb_dynamic.py `
--use-real-calibration `
--calibration-csv data/qwen2_5_1_5b_latency_grid.csv `
--compareSee CUDA_SETUP_COMPLETE.md for detailed GPU setup instructions.
Configuration: Real timestamps from BurstGPT dataset, 1.0s SLA, GPU-calibrated latency
| Scheduler | Requests | GPUs | SLA Violations | Avg Latency | Capacity QPS | GPU Util |
|---|---|---|---|---|---|---|
| static_fifo | 1K | 1 | 0.4% | 0.25s | 0.02 | 0.5% |
| static_fifo | 100K | 1 | 14.6% | 0.42s | 0.10 | 2.2% |
| dynamic_no_bins | 1K | 1 | 0.4% | 0.25s | 0.02 | 0.5% |
| dynamic_no_bins | 100K | 1 | 12.3% | 0.42s | 0.10 | 2.3% |
| multi_bin_dynamic | 1K | 4 | 0.1% | 0.25s | 0.02 | 0.1% |
| multi_bin_dynamic | 100K | 4 | 1.7% | 0.22s | 0.12 | 0.6% |
| multi_bin_dynamic | 1M | 4 | 4.9% | 0.30s | 0.26 | 1.7% |
| GPUs | SLA Violations | Avg Latency | Capacity QPS | GPU Util | Scaling Efficiency |
|---|---|---|---|---|---|
| 4 | 4.9% | 0.30s | 0.26 | 1.7% | baseline |
| 8 | 3.7% | 0.27s | 0.26 | 0.9% | 51% |
| 64 | 3.0% | 0.26s | 0.26 | 0.1% | 6% |
Real Production Patterns:
- β Low pressure: Real Azure traces don't overwhelm the system (0.5-2.3% GPU utilization)
- β Realistic SLA: 1.0s SLA is achievable for production LLM inference
- β Bursty patterns: Real timestamps preserve quiet periods and bursts
- β Natural load: 0.02-0.26 req/s capacity matches actual production rates
Multi-Bin Advantage at Scale:
- π 88% fewer violations at 100K requests (1.7% vs 14.6% for static)
- π 48% lower latency at 100K requests (0.22s vs 0.42s)
- π Scales to 1M requests with only 4.9% violations
- π Bin-specific learning adapts to each length category independently
GPU Scaling Reality:
β οΈ Limited scaling: Real traces don't saturate multiple GPUs (6% efficiency at 64 GPUs)β οΈ Arrival rate limited: Production workload isn't concurrent enough for massive parallelism- β 4-8 GPUs optimal: Sweet spot for real production traces
Bin-Specific Intelligence:
- Each bin maintains separate BatchStatistics and SLAController
- Bin 0 (short): Learns to use larger batches (fast, predictable)
- Bin 3 (long): Learns to use smaller batches (slow, high variance)
- Narrower distributions per bin β smaller E[max(t_j)] β better throughput
| Parameter | Default | Description |
|---|---|---|
NUM_GPUS |
4 | Number of GPUs |
NUM_REQUESTS |
10000 | Number of requests (1K-1M for stress tests) |
K_BINS |
4 | Number of multi-bin queues |
D_SLA |
1.0 | SLA deadline (seconds) - realistic for LLM inference |
USE_REAL_TIMESTAMPS |
False | False=RPS scaling (stress), True=real timestamps (realistic) β |
RPS_SCALING |
200.0 | RPS scaling factor (200x = 0.27β54 req/s for stress testing) |
B_MAX |
128 | Maximum dynamic batch size |
M_MAX_GB |
12.0 | GPU memory (GB) |
EXPERIMENT_MODE |
"multi_bin_dynamic" | Mode selection |
See mb_dyn_sim/config.py for all options.
Verify the simulator is working correctly:
# Quick test with 500 requests (~30 seconds)
python scripts/run_mb_dynamic.py --num-requests 500 --compare
# Standard test with 1000 requests (~1-2 minutes)
python scripts/run_mb_dynamic.py --num-requests 1000 --compare
# Full high-pressure test (10K requests, ~3-5 minutes)
python scripts/run_mb_dynamic.py --compare
# Test with custom SLA
python scripts/run_mb_dynamic.py --compare --d-sla 0.3 --num-requests 1000Verify that each bin maintains separate statistics and controllers:
python test_bin_specific.pyExpected output:
β Each bin maintains SEPARATE statistics and SLA controllers
β Bin 0 (short) learns from short request batches
β Bin 3 (long) learns from long request batches
β Bins adapt batch size independently based on their E[max(t_j)]
What Gets Validated:
- β All three scheduler modes produce distinct results
- β SLA violations: multi_bin_dynamic < dynamic_no_bins β static_fifo
- β Capacity under SLA: multi_bin_dynamic > others
- β Workload generation from BurstGPT dataset
- β Equal-mass bin boundary computation
- β Bin-specific statistics and controllers
- β GPU calibration data loading (if available)
- ARCHITECTURE.md - Complete process flows for all 3 scheduler types β
- BIN_SPECIFIC_BATCHING.md - Bin-specific dynamic batching enhancement β
- METRICS_GUIDE.md - Paper-aligned performance metrics reference β
- README.md - This file: overview and quick start
- CUDA_SETUP_COMPLETE.md - GPU calibration setup guide
- data/README.md - Dataset format specification
This simulator follows the wind tunnel testing approach:
| Aspect | Real Deployment | Our Simulator |
|---|---|---|
| Cost | $$$ (GPU cluster) | Free (CPU only) |
| Speed | Days/weeks | Seconds |
| Risk | User-facing | Zero (offline) |
| Iteration | Slow (A/B tests) | Fast (experiments) |
| Validity | Absolute numbers | Relative rankings β |
Key Principle: The simulator preserves algorithmic fidelity for valid scheduler comparisons, even with synthetic service times.
- Multi-Bin Batching for LLM Inference Throughput Optimization
- Memory-Aware and SLA-Constrained Dynamic Batching for LLM Inference
- BurstGPT: https://github.com/HPMLL/BurstGPT
- Real ChatGPT/GPT-4 workload traces from Azure (121 days, 5.29M requests)
- Qwen3-0.6B: https://huggingface.co/Qwen/Qwen3-0.6B
- Alternative: Qwen2.5-0.5B (currently available)
- vLLM: High-throughput LLM serving framework for calibration
Q: Do I need a GPU to run experiments?
A: No! The simulator runs on CPU. GPU is only needed for GPU calibration (optional for enhanced realism).
Q: Do I need the actual Qwen model?
A: No! The simulator uses GPU-calibrated latency data (already provided). You only need the model if re-calibrating from scratch.
Q: Are the results valid without real GPU measurements?
A: Yes! The provided GPU calibration data enables production-realistic simulations. Relative scheduler comparisons are scientifically valid.
Q: How do I run experiments?
A: See the Usage Examples section above or run python scripts/run_mb_dynamic.py --help for all options.
If you use this simulator in your research, please cite:
@software{multibin_dynamic_scheduler,
title={Multi-Bin Dynamic Batching Scheduler for LLM Inference},
author={Your Name},
year={2025},
note={Paper-faithful implementation of multi-bin batching and dynamic batching algorithms}
}And cite the BurstGPT dataset:
@inproceedings{BurstGPT,
author = {Yuxin Wang and Yuhan Chen and Zeyu Li and Xueze Kang and others},
title = {{BurstGPT}: A Real-World Workload Dataset to Optimize LLM Serving Systems},
booktitle = {KDD '25},
year = {2025},
}See LICENSE file for details.
Status: β
Production-ready with BurstGPT dataset + GPU-calibrated latency
Last Updated: November 2025