Skip to content

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717

Merged
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch
Apr 3, 2026
Merged

feat: AutoResearch framework + V2EX test suite (60 tasks, SKILL.md optimization)#717
jackwener merged 3 commits intomainfrom
feat/v2ex-autoresearch

Conversation

@jackwener
Copy link
Copy Markdown
Owner

@jackwener jackwener commented Apr 2, 2026

AutoResearch Framework + V2EX/Zhihu Test Suites + 10 Rounds Iteration

AutoResearch Framework (Karpathy-style)

  • engine.ts — 8-phase autonomous loop
  • config.ts + logger.ts — typed config + TSV logging
  • commands/run.ts, plan.ts, fix.ts, debug.ts
  • Presets: operate-reliability, skill-quality, v2ex, zhihu, combined

Test Suites — 194 total tasks, 100% pass rate

Suite Tasks Score
Browse (multi-site) 59 59/59
V2EX (7 layers + edge cases) 70 70/70
Zhihu (8 layers + edge cases) 65 65/65
Combined 194 194/194

Layer 2 E2E Results (Claude Code + SKILL.md)

Task Site Turns Cost
Hot extract V2EX 4 $0.21
Click + read V2EX 5 $0.23
Multi-step V2EX 7 $0.32
Form type V2EX 9 $0.32
Deep chain V2EX 7 $0.27
Hot extract Zhihu 4 $0.24
Question read Zhihu 8 $0.34
Search extract Zhihu 6 $0.20
Deep chain Zhihu 7 $0.25
Multi page Zhihu 9 $0.27

Average: 6.6 turns, $0.27/task (before SKILL.md optimization: 21 turns)

10 Rounds of AutoResearch Iteration

Round Action Result
1 Fix extract-npm-description selector 177→178
2 Fix nav-click-link-example + IMDB selector 178→179
3 Add 10 edge cases (SPA nav, lazy load, timing) 179→189
4-5 Add 5 agent-style tasks (state+click+type) 189→194
6-8 Layer 2 E2E efficiency evaluation 16/16 pass
9-10 Analyze viewport coverage, verify all green 194/194

SKILL.md Optimization

  • Aggressive chaining rules (open+state, type+type+click)
  • Minimize-turns guidance (target 3-5 per task)
  • -67% turns, -59% cost vs baseline

AutoResearch framework (Karpathy-style autonomous iteration):
- engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log)
- config.ts: typed config + CLI parser + metric extraction
- logger.ts: TSV append-only results log
- commands/run.ts: main loop spawning Claude Code per iteration
- commands/plan.ts: interactive config wizard
- commands/fix.ts: auto-detect broken state, iteratively fix
- commands/debug.ts: hypothesis-driven debugging for failing tasks

V2EX test suite (5 layers, 40 tasks):
- L1 Atomic (10): open, state, click, scroll, eval, back, wait
- L2 Single Page (10): hot topics, node list, topic meta, pagination
- L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination
- L4 Write Ops (5): reply typing, favorite detection, form detection
- L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow

Presets: operate-reliability, skill-quality, v2ex-reliability
- Fix v2ex-collect-hot-authors selector (pathname-based member link detection)
- Fix v2ex-wait-text judge (accept "appeared")
- Fix trailing commas in eval step strings
- Add 20 harder tasks: state+click interaction + long chain workflows
- Baseline: 60/60 across all layers
…e turns

- Add Rule #7: minimize total tool calls (3-5 per task, not 15-20)
- Strengthen Rule #5: chain aggressively with &&
- Add explicit good/bad chaining examples
- Add click+wait+state chaining pattern
- Add type+verify chaining pattern

Before: 21 turns for complex V2EX reply task
After: 12 turns for same task (-43% turns, -28% cost)
@jackwener jackwener merged commit 37f1b46 into main Apr 3, 2026
11 checks passed
just-buer pushed a commit to just-buer/opencli that referenced this pull request Apr 8, 2026
…timization) (jackwener#717)

* feat: AutoResearch framework + V2EX test suite (40 tasks)

AutoResearch framework (Karpathy-style autonomous iteration):
- engine.ts: 8-phase loop (review → modify → commit → verify → guard → decide → log)
- config.ts: typed config + CLI parser + metric extraction
- logger.ts: TSV append-only results log
- commands/run.ts: main loop spawning Claude Code per iteration
- commands/plan.ts: interactive config wizard
- commands/fix.ts: auto-detect broken state, iteratively fix
- commands/debug.ts: hypothesis-driven debugging for failing tasks

V2EX test suite (5 layers, 40 tasks):
- L1 Atomic (10): open, state, click, scroll, eval, back, wait
- L2 Single Page (10): hot topics, node list, topic meta, pagination
- L3 Multi-Step (10): click-read, navigate-node, tab-then-topic, pagination
- L4 Write Ops (5): reply typing, favorite detection, form detection
- L5 Complex Chain (5): cross-page collect, multi-node compare, full workflow

Presets: operate-reliability, skill-quality, v2ex-reliability

* test: V2EX test suite 60/60 — fix selectors, add harder tasks

- Fix v2ex-collect-hot-authors selector (pathname-based member link detection)
- Fix v2ex-wait-text judge (accept "appeared")
- Fix trailing commas in eval step strings
- Add 20 harder tasks: state+click interaction + long chain workflows
- Baseline: 60/60 across all layers

* docs: optimize SKILL.md for efficiency — aggressive chaining, minimize turns

- Add Rule jackwener#7: minimize total tool calls (3-5 per task, not 15-20)
- Strengthen Rule jackwener#5: chain aggressively with &&
- Add explicit good/bad chaining examples
- Add click+wait+state chaining pattern
- Add type+verify chaining pattern

Before: 21 turns for complex V2EX reply task
After: 12 turns for same task (-43% turns, -28% cost)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant