smaLLMs - AI Studio-Level Benchmarking Platform with Marathon Mode

Marathon Mode: Run overnight evaluation of ALL your local models with ALL 16 benchmarks
Supporting Ollama, LM Studio, and Cloud APIs with comprehensive AI studio-level evaluation!

What is smaLLMs?

smaLLMs is the most comprehensive local and cloud LLM evaluation platform, supporting ALL 16 benchmarks used by OpenAI, Anthropic, Google DeepMind, and xAI. Evaluate your Ollama and LM Studio models with the same rigor as top AI labs.

Key Features

Marathon Mode: Overnight evaluation of ALL local models with ALL benchmarks
16 AI Studio Benchmarks: Complete suite including AIME, GPQA, Codeforces, HealthBench
Local + Cloud: Seamlessly works with Ollama (15 models), LM Studio (8 models), and cloud APIs
One Command: python smaLLMs.py - everything integrated into a single file
Cost-Optimized: Smart sampling and rate limiting for efficient evaluation
Beautiful Interface: Real-time progress with color-coded results
Production-Ready: Battle-tested evaluation methodology
Organized Results: Date-based structure with clean exports
Windows Compatible: Full Unicode support and robust error handling

NEW: Marathon Mode + 16 AI Studio Benchmarks

Marathon Mode

Run overnight comprehensive evaluation of ALL your models:

Auto-Discovery: Finds all 23+ local models (Ollama + LM Studio)
Smart Selection: Choose specific models or run ALL discovered models
Benchmark Suites: 18 different benchmark combinations to choose from
Progress Tracking: Real-time updates and resume capability
Organized Results: Clean date/time-based result organization
Windows Compatible: Full Unicode support and robust timeout handling

16 AI Studio Benchmarks

Complete benchmark suite matching major AI companies:

Competition & Expert Level

AIME 2024/2025: American Invitational Mathematics Examination (o3/o4 level)
GPQA Diamond: PhD-level science questions (Google-Proof Q&A)
Codeforces: Competitive programming with ELO ratings
HLE: Humanity's Last Exam - Expert cross-domain evaluation
HealthBench: Medical conversation safety (includes Hard variant)
TauBench: Function calling and tool use evaluation

Core Academic Standards

GSM8K: Grade school mathematics reasoning
MMLU: Massive multitask language understanding
MATH: Mathematical reasoning and competition problems
HumanEval: Code generation and programming capabilities
ARC: Abstract reasoning challenge
HellaSwag: Commonsense reasoning

Advanced Reasoning

WinoGrande: Winograd schema challenge
BoolQ: Boolean question answering
OpenBookQA: Multi-step reasoning with facts
PIQA: Physical interaction question answering

18 Benchmark Suites Available

Individual Benchmarks: Any single benchmark (16 options)
OpenAI Suite: Complete o3/o4 benchmark set
Competition Suite: AIME + Codeforces + MATH
Expert Suite: GPQA + HLE + HealthBench
Academic Suite: MMLU + GSM8K + HumanEval
Reasoning Suite: ARC + HellaSwag + WinoGrande
Comprehensive Suite: Best 8-benchmark coverage

Confirmed Working Local Models (23+)

Ollama Models (15 discovered)

llama3.2 - Meta's latest compact models
qwen2.5 - Alibaba's optimized instruction models
qwen2.5-coder - Specialized coding variants
granite3.2 - IBM's enterprise-ready models
deepseek-r1 - Reasoning-focused models
gemma-3 - Google's efficient instruction models
liquid - High-performance compact models
And 8+ more automatically discovered

LM Studio Models (8 discovered)

Meta Llama variants - Multiple sizes and versions
Qwen2.5 series - Instruction and coder variants
Google Gemma models - Various parameter sizes
Granite models - IBM's latest offerings
DeepSeek variants - Reasoning and general models

Cloud Models (HuggingFace)

All models from the original cloud configuration still supported for comparison.

Technology Stack

Core Technologies

Python 3.8+: Modern async/await patterns for concurrent evaluation
HuggingFace Hub: Direct API integration for model inference
Datasets Library: Standardized benchmark data loading
AsyncIO: Non-blocking concurrent model evaluation
YAML: Human-readable configuration management

Data & Analytics

Pandas: Data manipulation and analysis
NumPy: Numerical computing for metrics calculation
SciPy: Statistical analysis and significance testing
Matplotlib/Seaborn: Data visualization for reports

Web & Interface

Gradio: Optional web interface for interactive evaluation
FastAPI: REST API for programmatic access
Beautiful Terminal: Custom ANSI-colored terminal interface
HTML Export: Static website generation from results

Evaluation Framework

Custom Benchmarks: Modular benchmark system
Async Model Manager: Efficient model loading and inference
Result Aggregation: Statistical analysis and ranking
Cost Estimation: Real-time API cost tracking

Quick Start

1. Installation

git clone https://github.com/mmdmcy/smaLLMs.git
cd smaLLMs
pip install -r requirements.txt

2. Setup Local Models (Optional)

# Install Ollama (if you want local models)
# Windows: Download from https://ollama.ai
# Then pull some models:
ollama pull llama3.2
ollama pull qwen2.5:0.5b
ollama pull granite3.2:2b

# Or use LM Studio: Download from https://lmstudio.ai

3. Configuration (Cloud models only)

# Only needed if using cloud models
cp config/config.example.yaml config/config.yaml
# Add your HuggingFace token to config/config.yaml

4. Run Marathon Mode

python smaLLMs.py

Marathon Mode Options:

Local: Auto-discover and evaluate all Ollama + LM Studio models
Cloud: Evaluate HuggingFace models (requires config)
Choose Models: Select specific models from 23+ discovered
Choose Benchmarks: Pick from 18 benchmark suite options
Run ALL: Overnight evaluation of everything!

5. Export & Analysis

python simple_exporter.py

Generate beautiful websites, leaderboards, and analysis reports from your Marathon Mode results.

Marathon Mode Performance

Setup	Models	Benchmarks	Samples	Duration	Use Case
Quick Test	3 local	2 core	25	~15 min	Testing setup
Standard	8 local	4 suites	50	~2 hours	Daily evaluation
Comprehensive	15 local	8 benchmarks	100	~6 hours	Weekly analysis
Marathon ALL	23 models	16 benchmarks	200	~12 hours	Complete evaluation

Local model evaluation is FREE - no API costs!

Confirmed Working Models

smaLLMs focuses on reliability with automatic model discovery:

Local Models (FREE)

Auto-discovered from Ollama & LM Studio:

15 Ollama models - Automatically detected and configured
8 LM Studio models - Seamlessly integrated
Progressive timeouts - Handles slower local inference
Efficient caching - Faster repeat evaluations

Cloud Models (API Required)

Battle-tested HuggingFace models:

google/gemma-2-2b-it - Google's efficient instruction model
Qwen/Qwen2.5-1.5B-Instruct - Alibaba's optimized model
meta-llama/Llama-3.2-1B-Instruct - Meta's compact model
HuggingFaceTB/SmolLM2-1.7B-Instruct - HF's optimized model
Plus 6 more proven models

Marathon Mode automatically discovers your available models - no manual configuration needed!

File Structure (Streamlined)

smaLLMs/
├── smaLLMs.py              # Main Marathon Mode launcher (ALL-IN-ONE)
├── intelligent_evaluator.py # Smart evaluation engine
├── simple_exporter.py      # Results export & website generation
├── beautiful_terminal.py   # Color terminal interface
├── test_everything.py      # Comprehensive test suite (15 tests)
├── check_local_services.py # Local model discovery utility
├── config/
│   ├── config.yaml            # Your configuration (cloud only)
│   ├── config.example.yaml    # Example configuration
│   └── models.yaml            # Model definitions
├── src/                    # Core evaluation modules
│   ├── models/                # Model management & discovery
│   ├── benchmarks/            # 16 benchmark implementations
│   ├── evaluator.py           # Evaluation orchestration
│   ├── metrics/               # Result analysis & aggregation
│   ├── utils/                 # Storage and utilities
│   └── web/                   # Optional web interface
└── smaLLMs_results/        # Marathon Mode results
    └── 2025-MM-DD/            # Date-based organization
        └── run_HHMMSS/        # Time-stamped runs
            ├── individual_results/ # Raw benchmark data
            ├── reports/           # Human-readable summaries
            └── exports/           # Website/analysis exports

Everything you need in 17 essential files - no bloat!

How It Works

1. Marathon Mode Discovery

# Automatic model discovery across platforms
models = discover_local_models()  # Finds Ollama + LM Studio
benchmarks = load_benchmark_suite()  # All 16 AI studio benchmarks

2. Intelligent Orchestration

Progressive Timeouts: Adapts to local model inference speed
Smart Sampling: Optimizes evaluation depth based on model performance
Error Recovery: Robust handling of model failures and timeouts
Progress Tracking: Real-time updates with beautiful terminal interface

3. Local Model Integration

Ollama API: Direct integration with local Ollama models
LM Studio API: Seamless connection to LM Studio inference server
Unified Interface: Same benchmarks work across all model types
No API Costs: Free evaluation of local models

4. Organized Data Management

Auto Directory Creation: Date/timestamp-based result organization
Multiple Formats: JSON for machines, human-readable summaries
Export Ready: One-click website and analysis generation
Resume Capability: Continue interrupted Marathon Mode runs

Cost Optimization & FREE Local Evaluation

FREE Local Models

Marathon Mode with local models is completely FREE:

No API costs for Ollama and LM Studio models
Unlimited evaluations - run as many benchmarks as you want
23+ models available - comprehensive local model comparison
Perfect for research and experimentation

Cost-Efficient Cloud Models

When using cloud APIs, smaLLMs is optimized for efficiency:

Smart Sampling: Don't waste tokens on failing models
Progressive Evaluation: Start small, scale for promising models
Rate Limiting: Respect free tier limits
Early Stopping: Skip models that consistently fail

Typical cloud costs:

Quick test (3 models, 2 benchmarks): ~$0.05
Standard evaluation (8 models, 4 benchmarks): ~$0.30
Comprehensive (15 models, 8 benchmarks): ~$1.20

Export & Integration

Marathon Mode Results Export

python simple_exporter.py

Generates:

Beautiful Websites: Interactive leaderboards and analysis
Comparison Charts: Visual model performance comparisons
CSV/JSON Data: Excel and analysis-ready formats
Markdown Reports: AI assistant and documentation ready
Leaderboards: Rank your local models against benchmarks

Integration Ready

REST API: Optional web interface (via FastAPI)
JSON Data: Machine-readable results for custom analysis
Modular Architecture: Easy to extend with custom benchmarks
Plugin System: Add new model providers and benchmarks

Configuration

Local Models (No config needed!)

Marathon Mode automatically discovers your models:

# Just run it - no configuration required!
python smaLLMs.py

Cloud Models (Optional)

# config/config.yaml (only if using cloud models)
huggingface:
  token: "your_hf_token_here"

evaluation:
  default_samples: 100
  max_concurrent_requests: 3
  
marathon_mode:
  local_models_enabled: true
  auto_discovery: true
  result_organization: "date_time"

Advanced Customization

Custom Benchmark Configurations: Modify sample sizes and parameters
Model-Specific Settings: Timeout and generation configs per model
Result Organization: Customize directory structure and naming
Export Formats: Configure output formats and destinations

Get Started with Marathon Mode

git clone https://github.com/mmdmcy/smaLLMs.git
cd smaLLMs
pip install -r requirements.txt
python smaLLMs.py

Choose your adventure:

Local Models: Auto-discover and evaluate all Ollama + LM Studio models (FREE!)
Cloud Models: Add HuggingFace token and evaluate cloud models
Marathon Mode: Run ALL models with ALL 16 benchmarks overnight
Custom: Pick specific models and benchmark suites

Join the local model revolution!

🛠️ Technical Architecture & Engineering Highlights

Engineered for performance, reliability, and extensibility.

🧠 Intelligent Orchestration Engine

Adaptive Sampling Algorithm: Implements a dynamic confidence-based sampling strategy. Instead of fixed sample sizes, the system analyzes real-time model reliability and accuracy to automatically adjust evaluation depth—increasing samples for promising models while "fail-fast" logic minimizes resource waste on poor performers.
AsyncIO Concurrency: Built on Python's asyncio event loop for non-blocking I/O. Uses aiohttp for high-concurrency API requests, enabling parallel evaluation of multiple models while maintaining responsive UI updates.
Resource-Aware Scheduling: Features an intelligent scheduler that manages system resources, enforcing rate limits (token bucket strategy) and implementing progressive timeouts that adapt to local inference latencies vs. cloud API speeds.

🏗️ Robust System Design

Unified Provider Abstraction: Implements the Strategy-Adapter Pattern to normalize interfaces across disparate backends. Whether communicating with a local Ollama instance via REST, an LM Studio server, or the HuggingFace Cloud API, the core ModelManager treats them uniformly.
Resilient Error Handling: "Marathon Mode" is built with fault tolerance at its core. It includes automatic session recovery, checkpointing (serialization of intermediate states), and comprehensive exception handling to ensure overnight runs complete successfully even if individual models crash.
Modular Benchmark System: Uses a Factory Pattern for dynamic benchmark loading, allowing new test suites to be plugged in without modifying core orchestration logic.

📊 Data & Analytics Pipeline

Real-time Analytics: Computes streaming metrics including "Reliability Score" (0-1 confidence metric) and "Value Score" (Accuracy per Dollar) during execution.
Structured Serialization: Results are serialized into a hierarchical JSON structure with full metadata preservation, enabling historical trend analysis and long-term regression testing.

Contributing

Help make smaLLMs even better:

New Benchmarks: Add domain-specific evaluation tasks
Model Providers: Integrate new local and cloud model platforms
Visualization: Enhance Marathon Mode result analysis
Performance: Optimize local model inference and evaluation speed

License

MIT License - see LICENSE for details.

Created By

mmdmcy - GitHub

Building comprehensive local model evaluation with Marathon Mode - because your local models deserve AI studio-level benchmarking.

smaLLMs Marathon Mode - Run overnight evaluation of ALL your models with ALL benchmarks. Local is the new cloud.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
results		results
src		src
.gitignore		.gitignore
README.md		README.md
beautiful_terminal.py		beautiful_terminal.py
check_local_services.py		check_local_services.py
comprehensive_results_report.txt		comprehensive_results_report.txt
intelligent_evaluator.py		intelligent_evaluator.py
requirements.txt		requirements.txt
simple_exporter.py		simple_exporter.py
smaLLMs.py		smaLLMs.py
test_everything.py		test_everything.py

mmdmcy/smaLLMs

Folders and files

Latest commit

History

Repository files navigation