feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization) by jackwener · Pull Request #717 · jackwener/OpenCLI

jackwener · 2026-04-02T20:31:19Z

AutoResearch Framework + V2EX/Zhihu Test Suites + 10 Rounds Iteration

AutoResearch Framework (Karpathy-style)

engine.ts — 8-phase autonomous loop
config.ts + logger.ts — typed config + TSV logging
commands/run.ts, plan.ts, fix.ts, debug.ts
Presets: operate-reliability, skill-quality, v2ex, zhihu, combined

Test Suites — 194 total tasks, 100% pass rate

Suite	Tasks	Score
Browse (multi-site)	59	59/59
V2EX (7 layers + edge cases)	70	70/70
Zhihu (8 layers + edge cases)	65	65/65
Combined	194	194/194

Layer 2 E2E Results (Claude Code + SKILL.md)

Task	Site	Turns	Cost
Hot extract	V2EX	4	$0.21
Click + read	V2EX	5	$0.23
Multi-step	V2EX	7	$0.32
Form type	V2EX	9	$0.32
Deep chain	V2EX	7	$0.27
Hot extract	Zhihu	4	$0.24
Question read	Zhihu	8	$0.34
Search extract	Zhihu	6	$0.20
Deep chain	Zhihu	7	$0.25
Multi page	Zhihu	9	$0.27

Average: 6.6 turns, $0.27/task (before SKILL.md optimization: 21 turns)

10 Rounds of AutoResearch Iteration

Round	Action	Result
1	Fix extract-npm-description selector	177→178
2	Fix nav-click-link-example + IMDB selector	178→179
3	Add 10 edge cases (SPA nav, lazy load, timing)	179→189
4-5	Add 5 agent-style tasks (state+click+type)	189→194
6-8	Layer 2 E2E efficiency evaluation	16/16 pass
9-10	Analyze viewport coverage, verify all green	194/194

SKILL.md Optimization

Aggressive chaining rules (open+state, type+type+click)
Minimize-turns guidance (target 3-5 per task)
-67% turns, -59% cost vs baseline

AutoResearch framework (Karpathy-style autonomous iteration): - engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log) - config.ts: typed config + CLI parser + metric extraction - logger.ts: TSV append-only results log - commands/run.ts: main loop spawning Claude Code per iteration - commands/plan.ts: interactive config wizard - commands/fix.ts: auto-detect broken state, iteratively fix - commands/debug.ts: hypothesis-driven debugging for failing tasks V2EX test suite (5 layers, 40 tasks): - L1 Atomic (10): open, state, click, scroll, eval, back, wait - L2 Single Page (10): hot topics, node list, topic meta, pagination - L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination - L4 Write Ops (5): reply typing, favorite detection, form detection - L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow Presets: operate-reliability, skill-quality, v2ex-reliability

- Fix v2ex-collect-hot-authors selector (pathname-based member link detection) - Fix v2ex-wait-text judge (accept "appeared") - Fix trailing commas in eval step strings - Add 20 harder tasks: state+click interaction + long chain workflows - Baseline: 60/60 across all layers

…e turns - Add Rule #7: minimize total tool calls (3-5 per task, not 15-20) - Strengthen Rule #5: chain aggressively with && - Add explicit good/bad chaining examples - Add click+wait+state chaining pattern - Add type+verify chaining pattern Before: 21 turns for complex V2EX reply task After: 12 turns for same task (-43% turns, -28% cost)

…timization) (jackwener#717) * feat: AutoResearch framework + V2EX test suite (40 tasks) AutoResearch framework (Karpathy-style autonomous iteration): - engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log) - config.ts: typed config + CLI parser + metric extraction - logger.ts: TSV append-only results log - commands/run.ts: main loop spawning Claude Code per iteration - commands/plan.ts: interactive config wizard - commands/fix.ts: auto-detect broken state, iteratively fix - commands/debug.ts: hypothesis-driven debugging for failing tasks V2EX test suite (5 layers, 40 tasks): - L1 Atomic (10): open, state, click, scroll, eval, back, wait - L2 Single Page (10): hot topics, node list, topic meta, pagination - L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination - L4 Write Ops (5): reply typing, favorite detection, form detection - L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow Presets: operate-reliability, skill-quality, v2ex-reliability * test: V2EX test suite 60/60 — fix selectors, add harder tasks - Fix v2ex-collect-hot-authors selector (pathname-based member link detection) - Fix v2ex-wait-text judge (accept "appeared") - Fix trailing commas in eval step strings - Add 20 harder tasks: state+click interaction + long chain workflows - Baseline: 60/60 across all layers * docs: optimize SKILL.md for efficiency — aggressive chaining, minimize turns - Add Rule jackwener#7: minimize total tool calls (3-5 per task, not 15-20) - Strengthen Rule jackwener#5: chain aggressively with && - Add explicit good/bad chaining examples - Add click+wait+state chaining pattern - Add type+verify chaining pattern Before: 21 turns for complex V2EX reply task After: 12 turns for same task (-43% turns, -28% cost)

jackwener added 3 commits April 3, 2026 03:13

jackwener merged commit 37f1b46 into main Apr 3, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch

jackwener commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackwener commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AutoResearch Framework + V2EX/Zhihu Test Suites + 10 Rounds Iteration

AutoResearch Framework (Karpathy-style)

Test Suites — 194 total tasks, 100% pass rate

Layer 2 E2E Results (Claude Code + SKILL.md)

10 Rounds of AutoResearch Iteration

SKILL.md Optimization

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jackwener commented Apr 2, 2026 •

edited

Loading