evaluation-suite

Here is 1 public repository matching this topic...

youdotcom-oss / web-search-agent-evals

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

benchmark mcp gemini headless-testing droid codex ai-agents web-search coding-agents model-context-protocol llm-judge claude-code agent-evaluation evaluation-suite

Updated Jan 30, 2026
TypeScript

Improve this page

Add a description, image, and links to the evaluation-suite topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation-suite topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation-suite

Here is 1 public repository matching this topic...

youdotcom-oss / web-search-agent-evals

Improve this page

Add this topic to your repo