Build software better, together

hidai25 / eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated May 15, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

collinear-ai / tau-trait

Star

TraitBasis applied to TauBench

rl-envs rl-training agent-benchmark

Updated Nov 11, 2025
Python

NoesisVision / nasde-toolkit

Star

CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini individual subscriptions or API keys.

Updated May 9, 2026
Python

justindobbs / Tracecore

Star

Deterministic runtime for agent evaluation

reliability-engineering specification ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-testing agent-benchmark deterministic-testing autoresearch

Updated Mar 25, 2026
Python

he-yufeng / CodeJoust

Star

Pit AI coding agents against the same bug. Score them on tests, diff, cost, and time — pick the winning patch.

python gemini codex cli-tool git-worktree llm aider claude-code coding-agent parallel-agents agent-benchmark ai-arena

Updated May 12, 2026
Python

ArshVermaGit / open-ev-code-handler

Star

Deterministic evaluation environment for AI code reviewers covering bugs, security (OWASP), and architecture via FastAPI + OpenEnv.

security-audit ai static-analysis owasp code-review software-architecture evaluation-framework ai-agents fastapi llm llm-evaluation agent-benchmark openenv

Updated Apr 8, 2026
Python

jackjin1997 / AgentBench-Live

Star

Variance-aware benchmark for AI coding agents. Same agent + same task can swing 70 points — we publish min/max, not just averages. Claude Code · Gemini CLI · Codex CLI · Aider · 10 tasks · Docker sandbox · MIT.

benchmark leaderboard evaluation variance reproducibility ai-agents aider llm-evaluation gemini-cli claude-code codex-cli agent-benchmark cli-agents

Updated May 14, 2026
Python

haoyifan / Silicon-Pantheon

Star

Silicon Pantheon - Tactics game played by AI agents coached by human

mcp turn-based gpt strategy-game ai-agents llm claude-code agent-benchmark competitive-ai

Updated May 4, 2026
Python

SanJueLogic / MeiGen-DesignAgentBench

Star

A reproducible benchmark for evaluating AI design agents across 7design scenarios. Double-blind SbS voting · 140 tasks · Bootstrap CI

benchmark reproducible-research leaderboard evaluation side-by-side image-generation text-to-image creative-ai multimodal human-evaluation ai-evaluation generative-ai design-agent agent-benchmark

Updated Apr 24, 2026
Python

weich97 / TradeArena

Star

Auditable benchmark framework for LLM trading agents with replayable trajectories, realistic execution, risk gates, and quickstart demos.

python benchmark reproducible-research portfolio-optimization quantitative-finance risk-management backtesting ai-agents financial-ai llm-agents auditability agent-evaluation trading-agents agent-benchmark execution-simulation

Updated May 17, 2026
Python

someonehereexists / AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

Star

AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO

python open-source benchmarking machine-learning ai leaderboard model-evaluation evaluation-framework ai-agents fastapi ai-platform llm llm-benchmarking agent-evaluation agent-benchmark

Updated Apr 5, 2026
Python

justindobbs / awesome-certified-agents

Star

A community catalog of autonomous agents and bundles certified by passing TraceCore deterministic episode runs in public CI

open-source benchmarking evaluation multi-agent deterministic ai-agents developer-tools-test agent-benchmark tracecore

Updated Mar 7, 2026
Python

SahilKumar75 / mario-the-plumber

Star

OpenEnv benchmark for broken ELT/ETL pipeline repair, online recovery, and temporal orchestration.

reinforcement-learning etl data-engineering fastapi huggingface-space agent-benchmark openenv

Updated Apr 12, 2026
Python

yessasvini23 / Inbox_Ops

Star

AI benchmark for real-world inbox prioritization and decision-making

reinforcement-learning decision-making task-planning ai-agents llm-agents agent-benchmark openenv

Updated Apr 12, 2026
Python

camerasearch / fieldopsbench

Star

Multimodal evaluation benchmark for AI agents in real-world field operations across 16 trades (HVAC, electrical, plumbing, roofing, solar, mining, oil & gas, marine, telecom, automotive, construction, and more). 194 cases; scores retrieval, code citation, jurisdiction, safety, trajectory, multi-turn, speed; 5-layer contamination defense.

benchmark evaluation electrical hvac trades ai-safety plumbing contamination-detection multimodal code-compliance huggingface-datasets vision-language-model llm-evaluation field-operations agent-benchmark

Updated Apr 19, 2026
Python

MohamedEmad219 / ai-agents-reality-check

Star

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated May 18, 2026
Python

hzang12345-ship-it / hermes-swarm-benchmark

Star

Concurrent-agent benchmark suite packaged as a Hermes skill. Markdown REPORT.md output

benchmark hermes agentic-ai agent-skills agent-benchmark ai-harness hermes-skill swarm-benchmark concurrent-agents

Updated May 9, 2026
Python

1Utkarsh1 / agentproof

Star

Developer-first agent benchmark orchestration, scoring, and reporting.

benchmark evaluation developer-tools ai-agents llm agent-benchmark

Updated May 13, 2026
Python

Siddharthjagtap346 / veritasops

Star

VeritasOps is a real-world OpenEnv benchmark for training and evaluating AI agents on misinformation moderation, claim verification, spread control, and content safety decision-making.

docker simulation openai agents fact-checking ai-safety ai-agents misinformation trust-and-safety content-moderation fastapi huggingface-spaces llm-evaluation agent-benchmark openenv openenv-environment

Updated Apr 6, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-benchmark

Here are 21 public repositories matching this topic...

hidai25 / eval-view

Cre4T3Tiv3 / ai-agents-reality-check

collinear-ai / tau-trait

NoesisVision / nasde-toolkit

justindobbs / Tracecore

he-yufeng / CodeJoust

ArshVermaGit / open-ev-code-handler

jackjin1997 / AgentBench-Live

haoyifan / Silicon-Pantheon

SanJueLogic / MeiGen-DesignAgentBench

weich97 / TradeArena

someonehereexists / AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

justindobbs / awesome-certified-agents

SahilKumar75 / mario-the-plumber

yessasvini23 / Inbox_Ops

camerasearch / fieldopsbench

MohamedEmad219 / ai-agents-reality-check

hzang12345-ship-it / hermes-swarm-benchmark

1Utkarsh1 / agentproof

Siddharthjagtap346 / veritasops

Improve this page

Add this topic to your repo