agent-evaluation

Here are 2 public repositories matching this topic...

plaited / agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

cli typescript grader ai-agents bun jsonl llm-evaluation agent-evaluation unix-pipeline agent-comparison trajectory-capture eval-harness pass-at-k headless-adapter

Updated Jan 30, 2026
TypeScript

youdotcom-oss / web-search-agent-evals

Star

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

benchmark mcp gemini headless-testing droid codex ai-agents web-search coding-agents model-context-protocol llm-judge claude-code agent-evaluation evaluation-suite

Updated Jan 30, 2026
TypeScript

Improve this page

Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly