Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
-
Updated
May 15, 2026 - Python
Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.
CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.
Deterministic runtime for agent evaluation
Pit AI coding agents against the same bug. Score them on tests, diff, cost, and time — pick the winning patch.
Deterministic evaluation environment for AI code reviewers covering bugs, security (OWASP), and architecture via FastAPI + OpenEnv.
Variance-aware benchmark for AI coding agents. Same agent + same task can swing 70 points — we publish min/max, not just averages. Claude Code · Gemini CLI · Codex CLI · Aider · 10 tasks · Docker sandbox · MIT.
Silicon Pantheon - Tactics game played by AI agents coached by human
A reproducible benchmark for evaluating AI design agents across 7design scenarios. Double-blind SbS voting · 140 tasks · Bootstrap CI
Auditable benchmark framework for LLM trading agents with replayable trajectories, realistic execution, risk gates, and quickstart demos.
AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO
A community catalog of autonomous agents and bundles certified by passing TraceCore deterministic episode runs in public CI
OpenEnv benchmark for broken ELT/ETL pipeline repair, online recovery, and temporal orchestration.
AI benchmark for real-world inbox prioritization and decision-making
Multimodal evaluation benchmark for AI agents in real-world field operations across 16 trades (HVAC, electrical, plumbing, roofing, solar, mining, oil & gas, marine, telecom, automotive, construction, and more). 194 cases; scores retrieval, code citation, jurisdiction, safety, trajectory, multi-turn, speed; 5-layer contamination defense.
🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.
Concurrent-agent benchmark suite packaged as a Hermes skill. Markdown REPORT.md output
Developer-first agent benchmark orchestration, scoring, and reporting.
VeritasOps is a real-world OpenEnv benchmark for training and evaluating AI agents on misinformation moderation, claim verification, spread control, and content safety decision-making.
Add a description, image, and links to the agent-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the agent-benchmark topic, visit your repo's landing page and select "manage topics."