Marathon Mode: Run overnight evaluation of ALL your local models with ALL 16 benchmarks
Supporting Ollama, LM Studio, and Cloud APIs with comprehensive AI studio-level evaluation!
smaLLMs is the most comprehensive local and cloud LLM evaluation platform, supporting ALL 16 benchmarks used by OpenAI, Anthropic, Google DeepMind, and xAI. Evaluate your Ollama and LM Studio models with the same rigor as top AI labs.
- Marathon Mode: Overnight evaluation of ALL local models with ALL benchmarks
- 16 AI Studio Benchmarks: Complete suite including AIME, GPQA, Codeforces, HealthBench
- Local + Cloud: Seamlessly works with Ollama (15 models), LM Studio (8 models), and cloud APIs
- One Command:
python smaLLMs.py- everything integrated into a single file - Cost-Optimized: Smart sampling and rate limiting for efficient evaluation
- Beautiful Interface: Real-time progress with color-coded results
- Production-Ready: Battle-tested evaluation methodology
- Organized Results: Date-based structure with clean exports
- Windows Compatible: Full Unicode support and robust error handling
Run overnight comprehensive evaluation of ALL your models:
- Auto-Discovery: Finds all 23+ local models (Ollama + LM Studio)
- Smart Selection: Choose specific models or run ALL discovered models
- Benchmark Suites: 18 different benchmark combinations to choose from
- Progress Tracking: Real-time updates and resume capability
- Organized Results: Clean date/time-based result organization
- Windows Compatible: Full Unicode support and robust timeout handling
Complete benchmark suite matching major AI companies:
- AIME 2024/2025: American Invitational Mathematics Examination (o3/o4 level)
- GPQA Diamond: PhD-level science questions (Google-Proof Q&A)
- Codeforces: Competitive programming with ELO ratings
- HLE: Humanity's Last Exam - Expert cross-domain evaluation
- HealthBench: Medical conversation safety (includes Hard variant)
- TauBench: Function calling and tool use evaluation
- GSM8K: Grade school mathematics reasoning
- MMLU: Massive multitask language understanding
- MATH: Mathematical reasoning and competition problems
- HumanEval: Code generation and programming capabilities
- ARC: Abstract reasoning challenge
- HellaSwag: Commonsense reasoning
- WinoGrande: Winograd schema challenge
- BoolQ: Boolean question answering
- OpenBookQA: Multi-step reasoning with facts
- PIQA: Physical interaction question answering
- Individual Benchmarks: Any single benchmark (16 options)
- OpenAI Suite: Complete o3/o4 benchmark set
- Competition Suite: AIME + Codeforces + MATH
- Expert Suite: GPQA + HLE + HealthBench
- Academic Suite: MMLU + GSM8K + HumanEval
- Reasoning Suite: ARC + HellaSwag + WinoGrande
- Comprehensive Suite: Best 8-benchmark coverage
- llama3.2 - Meta's latest compact models
- qwen2.5 - Alibaba's optimized instruction models
- qwen2.5-coder - Specialized coding variants
- granite3.2 - IBM's enterprise-ready models
- deepseek-r1 - Reasoning-focused models
- gemma-3 - Google's efficient instruction models
- liquid - High-performance compact models
- And 8+ more automatically discovered
- Meta Llama variants - Multiple sizes and versions
- Qwen2.5 series - Instruction and coder variants
- Google Gemma models - Various parameter sizes
- Granite models - IBM's latest offerings
- DeepSeek variants - Reasoning and general models
All models from the original cloud configuration still supported for comparison.
- Python 3.8+: Modern async/await patterns for concurrent evaluation
- HuggingFace Hub: Direct API integration for model inference
- Datasets Library: Standardized benchmark data loading
- AsyncIO: Non-blocking concurrent model evaluation
- YAML: Human-readable configuration management
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing for metrics calculation
- SciPy: Statistical analysis and significance testing
- Matplotlib/Seaborn: Data visualization for reports
- Gradio: Optional web interface for interactive evaluation
- FastAPI: REST API for programmatic access
- Beautiful Terminal: Custom ANSI-colored terminal interface
- HTML Export: Static website generation from results
- Custom Benchmarks: Modular benchmark system
- Async Model Manager: Efficient model loading and inference
- Result Aggregation: Statistical analysis and ranking
- Cost Estimation: Real-time API cost tracking
git clone https://github.com/mmdmcy/smaLLMs.git
cd smaLLMs
pip install -r requirements.txt# Install Ollama (if you want local models)
# Windows: Download from https://ollama.ai
# Then pull some models:
ollama pull llama3.2
ollama pull qwen2.5:0.5b
ollama pull granite3.2:2b
# Or use LM Studio: Download from https://lmstudio.ai# Only needed if using cloud models
cp config/config.example.yaml config/config.yaml
# Add your HuggingFace token to config/config.yamlpython smaLLMs.pyMarathon Mode Options:
- Local: Auto-discover and evaluate all Ollama + LM Studio models
- Cloud: Evaluate HuggingFace models (requires config)
- Choose Models: Select specific models from 23+ discovered
- Choose Benchmarks: Pick from 18 benchmark suite options
- Run ALL: Overnight evaluation of everything!
python simple_exporter.pyGenerate beautiful websites, leaderboards, and analysis reports from your Marathon Mode results.
| Setup | Models | Benchmarks | Samples | Duration | Use Case |
|---|---|---|---|---|---|
| Quick Test | 3 local | 2 core | 25 | ~15 min | Testing setup |
| Standard | 8 local | 4 suites | 50 | ~2 hours | Daily evaluation |
| Comprehensive | 15 local | 8 benchmarks | 100 | ~6 hours | Weekly analysis |
| Marathon ALL | 23 models | 16 benchmarks | 200 | ~12 hours | Complete evaluation |
Local model evaluation is FREE - no API costs!
smaLLMs focuses on reliability with automatic model discovery:
Auto-discovered from Ollama & LM Studio:
- 15 Ollama models - Automatically detected and configured
- 8 LM Studio models - Seamlessly integrated
- Progressive timeouts - Handles slower local inference
- Efficient caching - Faster repeat evaluations
Battle-tested HuggingFace models:
google/gemma-2-2b-it- Google's efficient instruction modelQwen/Qwen2.5-1.5B-Instruct- Alibaba's optimized modelmeta-llama/Llama-3.2-1B-Instruct- Meta's compact modelHuggingFaceTB/SmolLM2-1.7B-Instruct- HF's optimized model- Plus 6 more proven models
Marathon Mode automatically discovers your available models - no manual configuration needed!
smaLLMs/
├── smaLLMs.py # Main Marathon Mode launcher (ALL-IN-ONE)
├── intelligent_evaluator.py # Smart evaluation engine
├── simple_exporter.py # Results export & website generation
├── beautiful_terminal.py # Color terminal interface
├── test_everything.py # Comprehensive test suite (15 tests)
├── check_local_services.py # Local model discovery utility
├── config/
│ ├── config.yaml # Your configuration (cloud only)
│ ├── config.example.yaml # Example configuration
│ └── models.yaml # Model definitions
├── src/ # Core evaluation modules
│ ├── models/ # Model management & discovery
│ ├── benchmarks/ # 16 benchmark implementations
│ ├── evaluator.py # Evaluation orchestration
│ ├── metrics/ # Result analysis & aggregation
│ ├── utils/ # Storage and utilities
│ └── web/ # Optional web interface
└── smaLLMs_results/ # Marathon Mode results
└── 2025-MM-DD/ # Date-based organization
└── run_HHMMSS/ # Time-stamped runs
├── individual_results/ # Raw benchmark data
├── reports/ # Human-readable summaries
└── exports/ # Website/analysis exports
Everything you need in 17 essential files - no bloat!
# Automatic model discovery across platforms
models = discover_local_models() # Finds Ollama + LM Studio
benchmarks = load_benchmark_suite() # All 16 AI studio benchmarks- Progressive Timeouts: Adapts to local model inference speed
- Smart Sampling: Optimizes evaluation depth based on model performance
- Error Recovery: Robust handling of model failures and timeouts
- Progress Tracking: Real-time updates with beautiful terminal interface
- Ollama API: Direct integration with local Ollama models
- LM Studio API: Seamless connection to LM Studio inference server
- Unified Interface: Same benchmarks work across all model types
- No API Costs: Free evaluation of local models
- Auto Directory Creation: Date/timestamp-based result organization
- Multiple Formats: JSON for machines, human-readable summaries
- Export Ready: One-click website and analysis generation
- Resume Capability: Continue interrupted Marathon Mode runs
Marathon Mode with local models is completely FREE:
- No API costs for Ollama and LM Studio models
- Unlimited evaluations - run as many benchmarks as you want
- 23+ models available - comprehensive local model comparison
- Perfect for research and experimentation
When using cloud APIs, smaLLMs is optimized for efficiency:
- Smart Sampling: Don't waste tokens on failing models
- Progressive Evaluation: Start small, scale for promising models
- Rate Limiting: Respect free tier limits
- Early Stopping: Skip models that consistently fail
Typical cloud costs:
- Quick test (3 models, 2 benchmarks): ~$0.05
- Standard evaluation (8 models, 4 benchmarks): ~$0.30
- Comprehensive (15 models, 8 benchmarks): ~$1.20
python simple_exporter.pyGenerates:
- Beautiful Websites: Interactive leaderboards and analysis
- Comparison Charts: Visual model performance comparisons
- CSV/JSON Data: Excel and analysis-ready formats
- Markdown Reports: AI assistant and documentation ready
- Leaderboards: Rank your local models against benchmarks
- REST API: Optional web interface (via FastAPI)
- JSON Data: Machine-readable results for custom analysis
- Modular Architecture: Easy to extend with custom benchmarks
- Plugin System: Add new model providers and benchmarks
Marathon Mode automatically discovers your models:
# Just run it - no configuration required!
python smaLLMs.py# config/config.yaml (only if using cloud models)
huggingface:
token: "your_hf_token_here"
evaluation:
default_samples: 100
max_concurrent_requests: 3
marathon_mode:
local_models_enabled: true
auto_discovery: true
result_organization: "date_time"- Custom Benchmark Configurations: Modify sample sizes and parameters
- Model-Specific Settings: Timeout and generation configs per model
- Result Organization: Customize directory structure and naming
- Export Formats: Configure output formats and destinations
git clone https://github.com/mmdmcy/smaLLMs.git
cd smaLLMs
pip install -r requirements.txt
python smaLLMs.pyChoose your adventure:
- Local Models: Auto-discover and evaluate all Ollama + LM Studio models (FREE!)
- Cloud Models: Add HuggingFace token and evaluate cloud models
- Marathon Mode: Run ALL models with ALL 16 benchmarks overnight
- Custom: Pick specific models and benchmark suites
Join the local model revolution!
Engineered for performance, reliability, and extensibility.
- Adaptive Sampling Algorithm: Implements a dynamic confidence-based sampling strategy. Instead of fixed sample sizes, the system analyzes real-time model reliability and accuracy to automatically adjust evaluation depth—increasing samples for promising models while "fail-fast" logic minimizes resource waste on poor performers.
- AsyncIO Concurrency: Built on Python's
asyncioevent loop for non-blocking I/O. Usesaiohttpfor high-concurrency API requests, enabling parallel evaluation of multiple models while maintaining responsive UI updates. - Resource-Aware Scheduling: Features an intelligent scheduler that manages system resources, enforcing rate limits (token bucket strategy) and implementing progressive timeouts that adapt to local inference latencies vs. cloud API speeds.
- Unified Provider Abstraction: Implements the Strategy-Adapter Pattern to normalize interfaces across disparate backends. Whether communicating with a local Ollama instance via REST, an LM Studio server, or the HuggingFace Cloud API, the core
ModelManagertreats them uniformly. - Resilient Error Handling: "Marathon Mode" is built with fault tolerance at its core. It includes automatic session recovery, checkpointing (serialization of intermediate states), and comprehensive exception handling to ensure overnight runs complete successfully even if individual models crash.
- Modular Benchmark System: Uses a Factory Pattern for dynamic benchmark loading, allowing new test suites to be plugged in without modifying core orchestration logic.
- Real-time Analytics: Computes streaming metrics including "Reliability Score" (0-1 confidence metric) and "Value Score" (Accuracy per Dollar) during execution.
- Structured Serialization: Results are serialized into a hierarchical JSON structure with full metadata preservation, enabling historical trend analysis and long-term regression testing.
Help make smaLLMs even better:
- New Benchmarks: Add domain-specific evaluation tasks
- Model Providers: Integrate new local and cloud model platforms
- Visualization: Enhance Marathon Mode result analysis
- Performance: Optimize local model inference and evaluation speed
MIT License - see LICENSE for details.
mmdmcy - GitHub
Building comprehensive local model evaluation with Marathon Mode - because your local models deserve AI studio-level benchmarking.
smaLLMs Marathon Mode - Run overnight evaluation of ALL your models with ALL benchmarks. Local is the new cloud.