GitHub - ctrl-gaurav/BeyondBench: [ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Contamination-Resistant Evaluation of Reasoning in Language Models

🏆 101+ Models Evaluated • 🧠 44 Reasoning Tasks • 🎯 117 Variations • 📊 >10¹⁵ Unique Instances

🌟 Explore Leaderboard | 📖 Read Paper | 📦 PyPI | 📚 Documentation

📢 Latest News

Date	Update
Mar 6, 2026	v0.1.0 released — FastAPI serve, CLI improvements, CI/CD, comprehensive tests. See Changelog
Feb 25, 2026	v0.0.2 released — critical bug fixes, much more stable! See Changelog
Feb 25, 2026	v0.0.1 released — 44 tasks, 117 variations, 101+ models
Jan 2026	Paper accepted at ICLR 2026
Jan 2026	Interactive leaderboard website launched
Sep 2025	Paper submitted: arXiv:2509.24210

💡 What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

🌟 Key Highlights

🔄 Dynamic Problem Generation Problem space >10^15 unique instances Zero risk of data contamination Fresh problems on every evaluation	🎯 Three Difficulty Levels Easy: 29 fundamental reasoning tasks Medium: 5 tasks with 49 variations Hard: 10 tasks with 68 variations	🤖 Multi-Backend Support OpenAI, Gemini, Anthropic APIs vLLM for high-throughput local inference HuggingFace Transformers
📊 Comprehensive Metrics Accuracy across difficulty levels Instruction-following compliance Token efficiency analysis	🛡️ Contamination-Resistant No static benchmark memorization Novel problem generation Fair model comparison	⚡ Extensive Coverage 101+ models evaluated Open-source and proprietary Regular updates with new models

🚀 Installation

From PyPI

pip install beyondbench

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything
pip install beyondbench[full]

⚡ Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

API Server

# Start the BeyondBench API server
beyondbench serve --host 0.0.0.0 --port 8000

# API docs at http://localhost:8000/docs

Configuration Files

# Create a config interactively
beyondbench init

# Run from config file
beyondbench run-config beyondbench/configs/default.yaml

Results Viewer

# List past results
beyondbench results list

# Show detailed results
beyondbench results show ./beyondbench_results/final_results.json

# Compare two evaluations
beyondbench results compare result_a.json result_b.json

# Get task info
beyondbench info sorting

🔌 Supported Backends

Backend	Models	Features
OpenAI	GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini	Reasoning effort control
Gemini	Gemini 2.5 Pro, Gemini 2.5 Flash	Thinking budget configuration
Anthropic	Claude Sonnet 4, Claude Opus 4	Latest Claude models
vLLM	Any HuggingFace model	Batch processing, tensor parallelism
Transformers	Any HuggingFace model	CPU/GPU inference

📊 Results

🏆 Leaderboard (Top Models)

🏅 Rank	🤖 Model	📊 Overall	🎯 Instruction Following
🥇	GPT-5*	83.56%	96.15%
🥈	GPT-5-Nano*	82.04%	93.58%
🥉	GPT-5-Mini*	81.67%	94.23%
4	o3*	80.36%	94.96%
5	o4-Mini*	79.04%	95.30%

_{*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the paper and on the leaderboard.}

🔍 Key Findings

Reasoning Gap: Even top models show 20-30% performance drops on hard reasoning tasks
Scaling Effects: Larger models generally perform better, but the relationship is not always linear
Instruction vs. Accuracy: High accuracy does not guarantee perfect instruction-following

🧩 Task Suites

Easy Suite (29 Tasks)

Category	Tasks
Arithmetic	sum, multiplication, subtraction, division, absolute_difference
Statistics	mean, median, mode
Counting	odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count
Extrema	find_maximum, find_minimum, second_maximum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices
Sequences	sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits
Comparison	comparison

Medium Suite (5 Tasks, 49 Variations)

Task	Variations
Fibonacci Sequence	6 (Tribonacci, Lucas numbers, modified recursive)
Algebraic Sequence	10 (Polynomial, arithmetic, quadratic)
Geometric Sequence	10 (Exponential, compound growth, factorial)
Prime Sequence	11 (Prime gaps, twin primes, Sophie Germain)
Complex Pattern	12 (Interleaved, conditional, multi-rule)

Hard Suite (10 Tasks, 68 Variations)

Task	Variations	Complexity
Tower of Hanoi	6	O(2^n) moves
N-Queens	4	NP-complete
Graph Coloring	10	NP-complete
Boolean SAT	5	NP-complete
Sudoku	8	Constraint satisfaction
Cryptarithmetic	12	Constraint satisfaction
Matrix Chain	5	Dynamic programming
Modular Systems	5	Number theory
Constraint Optimization	5	Operations research
Logic Grid Puzzles	8	Deductive reasoning

📚 Documentation

Full Documentation — Complete API reference and configuration guide
Usage Guide — Detailed usage examples for all backends

Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

🤝 Contributing

We welcome contributions! See the Contributing Guide for details.

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

🛠️ Ways to Contribute

🐛 Bug Reports: Found an issue? Report it here
✨ Feature Requests: Have ideas? Share them here
🔧 Code Contributions: Submit PRs for improvements
📚 Documentation: Help improve our docs
🤖 Model Submissions: Suggest models for evaluation

📝 Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

📞 Contact & Support

📧 Email: gks@vt.edu, xuanw@vt.edu
🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions

📜 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🚀 Ready to Explore the Future of AI Evaluation?

Made with ❤️ by the BeyondBench Team

Advancing the frontier of AI reasoning evaluation, one benchmark at a time 🌟

🏠 Home	📊 Leaderboard	📖 Paper	💻 Code
Main website	Interactive rankings	Research paper	Source code

🎯 Transform your understanding of AI capabilities. BeyondBench reveals what language models can truly reason about, beyond memorization. Start exploring now →

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
assets		assets
beyondbench		beyondbench
docs		docs
main_website		main_website
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_PYPI.md		README_PYPI.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_beyondbench.sh		setup_beyondbench.sh

Folders and files

Latest commit

History

Repository files navigation

📢 Latest News

💡 What is BeyondBench?

🌟 Key Highlights

🔄 Dynamic Problem Generation

🎯 Three Difficulty Levels

🤖 Multi-Backend Support

📊 Comprehensive Metrics

🛡️ Contamination-Resistant

⚡ Extensive Coverage

🚀 Installation

From PyPI

From Source

With Optional Dependencies

⚡ Quick Start

Interactive Wizard

Command Line

Python API

API Server

Configuration Files

Results Viewer

🔌 Supported Backends

📊 Results

🏆 Leaderboard (Top Models)

🔍 Key Findings

🧩 Task Suites

📚 Documentation

Environment Variables

🤝 Contributing

🛠️ Ways to Contribute

📝 Citation

📞 Contact & Support

📜 License

🚀 Ready to Explore the Future of AI Evaluation?

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages