🔬 Agent Harness Benchmark

不是比模型，是比 Harness。

大部分 AI benchmark 做的事情是：固定 harness，把不同的模型插上去跑分。但 agent 的实际表现 = 模型 × harness。如果你手搓了一个 agent，你需要回答的问题不是"我用的模型强不强"，而是：

我的 harness 设计，到底兑现了模型多少潜力？

这个项目就是用来回答这个问题的。固定模型，比较不同 harness。用数据说话。

核心结论

在同一个基座模型 (Claude Sonnet) 上，不同的 harness 设计导致了 0% → 37% → 80% 的性能差异：

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   A 组 (裸模型, 无工具)          ████░░░░░░░░░░░░░░░  0%    │
│                                                             │
│   B 组 (自搭 Harness)           ████████████░░░░░░░  37%   │
│                                                             │
│   C 组 (Claude.ai 官方)         █████████████████░░  80%   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

30 道题 · 同一模型 · 差异 = harness 设计

实验设计

三组对照

组别	架构	工具	搜索
A 组裸模型	单轮 API	无	无
B 组自搭 Harness	8 轮 Agentic Loop	web_search, fetch_url, run_code	5 引擎 fallback
C 组 Claude.ai	Anthropic 官方	搜索、代码、文件分析	Anthropic 内置

测试集

30 道题，来自两个公开数据集：

GAIA (20 题) — 多步推理 + 工具使用
FRAMES (10 题) — 多步约束推理

关键设计：先用 A 组跑一遍全量题，只保留 A 组答不对的题。这保证了每道题都有区分度——A 组是 0 分基线，B/C 组的得分就是 harness 的净增量。

关键发现

1. Harness 设计的增量价值是可量化的

数据集	A 组	B 组	C 组
GAIA L1 (15 题)	0%	53%	87%
GAIA L2 (5 题)	0%	40%	60%
FRAMES (10 题)	0%	10%	80%

2. B 组 vs C 组: 43% 差距的来源

失败原因	题数	可改进性
搜索质量不足	5	高 — 换更好的搜索引擎
多步推理链断裂	5	中 — 优化 prompt 策略
搜索策略低效	3	中 — 增加提前终止
未触发搜索	2	高 — 强制搜索指令

3. 搜索次数与通过率的反直觉关系

工具调用 0-2 次:  ████████████████████  71% 通过
工具调用 3-5 次:  ████████████░░░░░░░░  44% 通过
工具调用 6-9 次:  ████░░░░░░░░░░░░░░░░  14% 通过

搜索越多，通过率越低。不是搜索没用，而是 搜了很多轮还没找到答案 = 搜索策略已经偏了。当前 harness 缺乏"及时止损"机制。

4. 自搭 Harness 也有赢的时候

B 组在 2 道题上赢了 Claude.ai（魔方推理、文章统计）。样本太小不能下结论，但它指向一个有意思的设计权衡：Claude.ai 倾向让模型自己推理，B 组更"工具优先"——在需要硬验证的题上，工具优先反而胜出。"更像 Claude.ai" ≠ "永远更好"。

完整的实验分析见博客：用 30 道题量化我和 Claude.ai 的差距。

方法论白皮书：Agent Harness 工程化白皮书。

快速复现

# 1. 配置 API key
cp .env.example .env
# 编辑 .env 填入 Anthropic API key 和至少一个搜索引擎 key

# 2. 跑 A 组 (裸模型, 确认 baseline 为 0)
./scripts/run.sh --group a --all

# 3. 跑 B 组 (你的 harness)
./scripts/run.sh --group b --all

# 4. 对比报告
./scripts/report.sh results/baseline_xxx results/harness_xxx

# 5. 跑 C 组 (Claude.ai 人工测试)
# 浏览器打开 c_test.html，逐题在 Claude.ai 中测试并记录

项目结构

├── tasks_selected/            # 精选 30 道测试题
├── scripts/
│   ├── run.sh                 # 主运行脚本
│   ├── baseline_runner.mjs    # A 组: 单轮 API, 无工具
│   ├── harness_runner.mjs     # B 组: agentic loop + 工具
│   ├── report.sh              # 对比报告生成
│   ├── seed_tasks.mjs         # 从数据集导出题目
│   └── collect_failure.sh     # 收集实际使用中的失败 case
├── results/                   # 实验结果数据
│   ├── harness_*/             # B 组运行结果
│   └── claude_ai_results_template.jsonl
├── c_test.html                # C 组交互测试界面
├── claude_ai_results.jsonl    # C 组测试结果
├── comparison_report.md       # A/B/C 三组对比
├── REPORT.md                  # 详细评测报告
├── prompts/                   # System prompts
└── .env.example               # API key 模板

替换你自己的 Harness

这个框架设计成可以插拔 harness 的：

修改 scripts/harness_runner.mjs 中的 agenticSystemPrompt 和 tools
或者写一个全新的 runner（只要输出兼容 JSON：{ result, input_tokens, output_tokens, duration_seconds, tool_trace })
运行 ./scripts/run.sh --group b --all 即可和 baseline 对比

改一行 system prompt 可能值好几个百分点。改搜索引擎可能值十几个点。这就是这套框架要帮你回答的问题：改了什么、涨了多少、还差在哪。

数据集来源

数据集	链接	许可
GAIA	gaia-benchmark/GAIA	CC BY 4.0
FRAMES	google/frames-benchmark	Apache 2.0

引用

如果这个项目对你有帮助：

@misc{agent-harness-benchmark,
  title={Agent Harness Benchmark: 固定模型，比较 Harness},
  year={2026},
  url={https://github.com/piglet12138/agent-research}
}

License

MIT

文章	内容
Agent Harness 工程化白皮书	系统拆解 Claude.ai 与 Claude Code 的 harness 设计，提出 eval 驱动迭代方法论
用 30 道题量化我和 Claude.ai 的差距	本仓库实验的完整分析：三组对比、逐题归因、改进路线
Lite Claude UI	B 组 harness 的实现：轻量 AI Agent 工作台

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 Agent Harness Benchmark

核心结论

实验设计

三组对照

测试集

关键发现

1. Harness 设计的增量价值是可量化的

2. B 组 vs C 组: 43% 差距的来源

3. 搜索次数与通过率的反直觉关系

4. 自搭 Harness 也有赢的时候

相关文章

快速复现

项目结构

替换你自己的 Harness

数据集来源

引用

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pics		pics
prompts		prompts
results		results
scripts		scripts
tasks_selected		tasks_selected
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
REPORT.md		REPORT.md
agent_harness_eval_blog.md		agent_harness_eval_blog.md
c_test.html		c_test.html
claude_ai_results.jsonl		claude_ai_results.jsonl
claude_ai_test_prompts.txt		claude_ai_test_prompts.txt
comparison_report.md		comparison_report.md

Folders and files

Latest commit

History

Repository files navigation

🔬 Agent Harness Benchmark

核心结论

实验设计

三组对照

测试集

关键发现

1. Harness 设计的增量价值是可量化的

2. B 组 vs C 组: 43% 差距的来源

3. 搜索次数与通过率的反直觉关系

4. 自搭 Harness 也有赢的时候

相关文章

快速复现

项目结构

替换你自己的 Harness

数据集来源

引用

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages