Skip to content

ctrl-gaurav/BeyondBench

BeyondBench Banner

Paper Conference PyPI Python CI License Stars

Contamination-Resistant Evaluation of Reasoning in Language Models

πŸ† 101+ Models Evaluated β€’ 🧠 44 Reasoning Tasks β€’ 🎯 117 Variations β€’ πŸ“Š >1015 Unique Instances

🌟 Explore Leaderboard | πŸ“– Read Paper | πŸ“¦ PyPI | πŸ“š Documentation


πŸ“’ Latest News

Date Update
Mar 6, 2026 v0.1.0 released β€” FastAPI serve, CLI improvements, CI/CD, comprehensive tests. See Changelog
Feb 25, 2026 v0.0.2 released β€” critical bug fixes, much more stable! See Changelog
Feb 25, 2026 v0.0.1 released β€” 44 tasks, 117 variations, 101+ models
Jan 2026 Paper accepted at ICLR 2026
Jan 2026 Interactive leaderboard website launched
Sep 2025 Paper submitted: arXiv:2509.24210

πŸ’‘ What is BeyondBench?

BeyondBench introduces a revolutionary approach to evaluating reasoning capabilities in language models without relying on traditional static benchmarks. Our system dynamically generates novel problems across 44 distinct reasoning tasks with 117 variations, ensuring that models cannot memorize solutions and must demonstrate true reasoning abilities.

🌟 Key Highlights

πŸ”„ Dynamic Problem Generation

  • Problem space >10^15 unique instances
  • Zero risk of data contamination
  • Fresh problems on every evaluation

🎯 Three Difficulty Levels

  • Easy: 29 fundamental reasoning tasks
  • Medium: 5 tasks with 49 variations
  • Hard: 10 tasks with 68 variations

πŸ€– Multi-Backend Support

  • OpenAI, Gemini, Anthropic APIs
  • vLLM for high-throughput local inference
  • HuggingFace Transformers

πŸ“Š Comprehensive Metrics

  • Accuracy across difficulty levels
  • Instruction-following compliance
  • Token efficiency analysis

πŸ›‘οΈ Contamination-Resistant

  • No static benchmark memorization
  • Novel problem generation
  • Fair model comparison

⚑ Extensive Coverage

  • 101+ models evaluated
  • Open-source and proprietary
  • Regular updates with new models

πŸš€ Installation

From PyPI

pip install beyondbench

From Source

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e .

With Optional Dependencies

# All API clients (OpenAI, Gemini, Anthropic)
pip install beyondbench[all-apis]

# vLLM support (requires CUDA)
pip install beyondbench[vllm]

# Everything
pip install beyondbench[full]

⚑ Quick Start

Interactive Wizard

beyondbench

Command Line

# Evaluate GPT-4o on the easy suite
beyondbench evaluate --model-id gpt-4o --api-provider openai --suite easy

# Evaluate a local model with vLLM
beyondbench evaluate --model-id meta-llama/Llama-3.2-3B-Instruct --backend vllm --suite all

# Evaluate Claude on hard tasks
beyondbench evaluate --model-id claude-sonnet-4-20250514 --api-provider anthropic --suite hard

# List available tasks
beyondbench list-tasks

Python API

from beyondbench import EvaluationEngine, ModelHandler, TaskRegistry

# Initialize model handler
model = ModelHandler(
    model_id="gpt-4o",
    api_provider="openai",
    api_key="your-api-key"
)

# Run evaluation
engine = EvaluationEngine(model_handler=model, output_dir="./results")
results = engine.run_evaluation(suite="easy", datapoints=100)

# Print results
print(f"Average Accuracy: {results['summary']['avg_accuracy']:.2%}")

API Server

# Start the BeyondBench API server
beyondbench serve --host 0.0.0.0 --port 8000

# API docs at http://localhost:8000/docs

Configuration Files

# Create a config interactively
beyondbench init

# Run from config file
beyondbench run-config beyondbench/configs/default.yaml

Results Viewer

# List past results
beyondbench results list

# Show detailed results
beyondbench results show ./beyondbench_results/final_results.json

# Compare two evaluations
beyondbench results compare result_a.json result_b.json

# Get task info
beyondbench info sorting

πŸ”Œ Supported Backends

Backend Models Features
OpenAI GPT-4o, GPT-4o-mini, GPT-5, GPT-5-mini Reasoning effort control
Gemini Gemini 2.5 Pro, Gemini 2.5 Flash Thinking budget configuration
Anthropic Claude Sonnet 4, Claude Opus 4 Latest Claude models
vLLM Any HuggingFace model Batch processing, tensor parallelism
Transformers Any HuggingFace model CPU/GPU inference

πŸ“Š Results

πŸ† Leaderboard (Top Models)

πŸ… Rank πŸ€– Model πŸ“Š Overall 🎯 Instruction Following
πŸ₯‡GPT-5*83.56%96.15%
πŸ₯ˆGPT-5-Nano*82.04%93.58%
πŸ₯‰GPT-5-Mini*81.67%94.23%
4o3*80.36%94.96%
5o4-Mini*79.04%95.30%

*Models marked with * use reasoning/thinking tokens. Full results for 101+ models available in the paper and on the leaderboard.

πŸ” Key Findings

  • Reasoning Gap: Even top models show 20-30% performance drops on hard reasoning tasks
  • Scaling Effects: Larger models generally perform better, but the relationship is not always linear
  • Instruction vs. Accuracy: High accuracy does not guarantee perfect instruction-following

🧩 Task Suites

Easy Suite (29 Tasks)
Category Tasks
Arithmetic sum, multiplication, subtraction, division, absolute_difference
Statistics mean, median, mode
Counting odd_count, even_count, count_negative, count_unique, count_greater_than_previous, count_palindromic, count_perfect_squares, count_multiples, local_maxima_count
Extrema find_maximum, find_minimum, second_maximum, range, index_of_maximum, max_adjacent_difference, sum_of_max_indices
Sequences sorting, longest_increasing_subsequence, alternating_sum, sum_of_digits
Comparison comparison
Medium Suite (5 Tasks, 49 Variations)
Task Variations
Fibonacci Sequence 6 (Tribonacci, Lucas numbers, modified recursive)
Algebraic Sequence 10 (Polynomial, arithmetic, quadratic)
Geometric Sequence 10 (Exponential, compound growth, factorial)
Prime Sequence 11 (Prime gaps, twin primes, Sophie Germain)
Complex Pattern 12 (Interleaved, conditional, multi-rule)
Hard Suite (10 Tasks, 68 Variations)
Task Variations Complexity
Tower of Hanoi 6 O(2^n) moves
N-Queens 4 NP-complete
Graph Coloring 10 NP-complete
Boolean SAT 5 NP-complete
Sudoku 8 Constraint satisfaction
Cryptarithmetic 12 Constraint satisfaction
Matrix Chain 5 Dynamic programming
Modular Systems 5 Number theory
Constraint Optimization 5 Operations research
Logic Grid Puzzles 8 Deductive reasoning

πŸ“š Documentation

Environment Variables

export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..."

🀝 Contributing

We welcome contributions! See the Contributing Guide for details.

git clone https://github.com/ctrl-gaurav/BeyondBench.git
cd BeyondBench
pip install -e ".[dev]"
pre-commit install
pytest tests/ -v

πŸ› οΈ Ways to Contribute

  • πŸ› Bug Reports: Found an issue? Report it here
  • ✨ Feature Requests: Have ideas? Share them here
  • πŸ”§ Code Contributions: Submit PRs for improvements
  • πŸ“š Documentation: Help improve our docs
  • πŸ€– Model Submissions: Suggest models for evaluation

πŸ“ Citation

If you use BeyondBench in your research, please cite our paper (accepted at ICLR 2026):

@misc{srivastava2025beyondbenchbenchmarkfreeevaluationreasoning,
      title={BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models},
      author={Gaurav Srivastava and Aafiya Hussain and Zhenyu Bi and Swastik Roy and Priya Pitre and Meng Lu and Morteza Ziyadi and Xuan Wang},
      year={2025},
      eprint={2509.24210},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24210},
}

πŸ“ž Contact & Support


πŸ“œ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


πŸš€ Ready to Explore the Future of AI Evaluation?

Explore Leaderboard

Made with ❀️ by the BeyondBench Team

Virginia Tech Amazon AGI

Advancing the frontier of AI reasoning evaluation, one benchmark at a time 🌟


🏠 Home πŸ“Š Leaderboard πŸ“– Paper πŸ’» Code
Main website Interactive rankings Research paper Source code

🎯 Transform your understanding of AI capabilities. BeyondBench reveals what language models can truly reason about, beyond memorization. Start exploring now β†’


BeyondBench Logo

About

[ICLR 2026 Accepted paper] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors