Skip to content

Comments

Add GP earning benchmark (5-loop iterative)#37

Open
classicrob wants to merge 1 commit intoMaxBittker:mainfrom
classicrob:gp-benchmark
Open

Add GP earning benchmark (5-loop iterative)#37
classicrob wants to merge 1 commit intoMaxBittker:mainfrom
classicrob:gp-benchmark

Conversation

@classicrob
Copy link

Summary

  • New Harbor benchmark task gp-10k-ticks: agents earn as much gold as possible across 5 iterative loops
  • Each loop spawns a fresh sub-agent with no memory — only learnings.md and gp_results.json carry forward
  • Each loop gets 5 bots (level 50 all skills, Lumbridge, 0 coins) with a 10,000 tick limit per script
  • Adds gemini-flash (gemini-3-flash-preview), codex53 (gpt-5.3-codex), and kimi models to run.sh

New files

  • benchmark/shared/gp_loop_instruction.md — per-loop instruction for sub-agents
  • benchmark/shared/generate_gp_saves.ts — generates 25 save files (5 bots × 5 loops)
  • benchmark/shared/check_gp.ts — verifier that reads per-loop GP results + verifies inventory

Test plan

  • bun benchmark/generate-tasks.ts generates gp-10k-ticks/ with correct instruction, Dockerfile, and verifier
  • Run with benchmark/run.sh -t gp-10k-ticks -m gemini-flash -m codex53

🤖 Generated with Claude Code

New Harbor benchmark task where agents earn as much gold as possible.
5 loops with fresh sub-agents per loop, learnings.md as the handoff
document between loops. Each loop gets 5 bots (level 50 all skills)
and 10,000 game ticks per script. No pickpocketing.

Also adds gemini-flash, gpt-5.3-codex, and kimi models to run.sh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant