-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Overview
STATUS: ✅ COMPLETE - Technical implementation finished. Research study split to #201.
Implement real Harbor framework integration for the Terminal-Bench eval harness.
Completed: Harbor framework integration fully functional with real Terminal-Bench evaluations
Follow-on: Assessor refinement research study → #201
What Was Implemented ✅
1. Real Harbor Framework Integration
- ✅ Implemented
_real_tbench_result()with Harbor subprocess API - ✅ Harbor 2.0 result.json parsing
- ✅ Environment variable handling (ANTHROPIC_API_KEY, ANTHROPIC_AUTH_TOKEN)
- ✅ MiniMax API override fix (Harbor hardcoded issue)
- ✅ Trajectory file path capture
- ✅ Security fixes (path validation, timeout handling, input validation)
2. CLI Implementation
- ✅
agentready benchmarkcommand - ✅ Terminal-Bench smoketest and full benchmark support
- ✅ Model selection (claude-haiku-4-5, claude-sonnet-4-5)
- ✅ Timeout configuration
- ✅ Verbose output mode
- ✅
--skip-preflightoption for advanced users
3. Preflight Checks
- ✅ Automatic Harbor CLI detection
- ✅ Interactive installation prompts (uv/pip fallback)
- ✅ Terminal-Bench dataset management
- ✅ Graceful error handling
4. Tests
- ✅ TbenchResult model validation
- ✅ Harbor configuration tests
- ✅ Harbor services integration tests
- ✅ Preflight check tests (100% coverage)
5. Documentation
- ✅ Updated CLAUDE.md with Harbor integration details
- ✅ Preflight check documentation
- ✅ Usage examples in benchmark.py docstrings
What Was Split Out 📋
Assessor Refinement Research Study → Issue #201
The empirical research component (running 10-20 diverse repositories through Terminal-Bench to measure assessor impact) has been split into a separate issue. This was done because:
- Technical work complete: All infrastructure needed for the research study is now in place
- Different workflow: Research study is data collection + analysis, not implementation
- Separate timeline: Can be executed independently after Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 merges
See #201 for the complete research study plan.
Success Criteria (Original vs Actual)
| Criteria | Status | Notes |
|---|---|---|
| Real benchmark runs work end-to-end | ✅ YES | Harbor integration fully functional |
| Tests pass | ✅ YES | All tests passing |
| Harbor framework integration | ✅ YES | Complete with security fixes |
| CLI with subset options | ✅ YES | smoketest/full support |
| Documentation updated | ✅ YES | CLAUDE.md updated |
| 📋 #201 | Split to follow-on research issue | |
| 📋 #201 | Split to follow-on research issue | |
docs/tbench/assessor-refinement-results.md |
📋 #201 | Split to follow-on research issue |
Key Implementation Details
Files Changed
Core Implementation:
src/agentready/services/eval_harness/tbench_runner.py- Harbor subprocess integrationsrc/agentready/services/eval_harness/harbor_config.py- Configuration modelsrc/agentready/cli/benchmark.py- CLI commandsrc/agentready/utils/preflight.py- Dependency checking
Tests:
tests/unit/test_harbor_*.py- Harbor integration teststests/unit/utils/test_preflight.py- Preflight check tests (100% coverage)
Documentation:
CLAUDE.md- Updated with Harbor integration, preflight checksdocs/tbench/methodology.md- A/B testing methodology
Usage Examples
# Quick smoketest (1-2 tasks, ~2-5 min)
export ANTHROPIC_API_KEY=your-key-here
agentready benchmark --subset smoketest
# Full Terminal-Bench with Sonnet (~30-40 min)
agentready benchmark --subset full --model claude-sonnet-4-5
# Skip preflight checks (advanced)
agentready benchmark --subset smoketest --skip-preflightEvidence of Completion
Harbor Integration Working:
- Real Harbor result.json files in
jobs/directory - Successful Terminal-Bench task configurations
- Trajectory file capture working
- All tests passing
Commits (17 total):
9e9cc32- feat: display Harbor command with copy/paste format389958f- fix: override Harbor's hardcoded MiniMax API configurationd6f583e- feat: display trajectory file path in benchmark summaryf9f6cfb- fix: set ANTHROPIC_AUTH_TOKEN for Harbor's Claude Code agent97b848f- feat: add automatic Harbor CLI preflight checksb91d11b- fix: correct Harbor results parsing (Harbor 2.0 structure)bdbabb2- fix(security): implement critical security fixes4d5649a- feat: add Harbor framework integration (initial)- ...and 9 more commits
Follow-On Work
Immediate:
- Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201 - Assessor refinement research study (10-20 repos, statistical analysis)
Future:
- Dashboard integration for Terminal-Bench results
- Historical tracking of benchmark performance
- Leaderboard integration (if desired)
Resources
Implementation:
- Harbor Framework: https://harborframework.com/docs
- Terminal-Bench: https://tbench.ai
- Branch:
002-harbor-real-integration
Related Issues:
- Terminal-Bench Assessor Refinement Research Study (Phase 2b) #201 - Assessor Refinement Research Study (follow-on)
- feat: Terminal-Bench eval harness (MVP Phase 1) #178 - Terminal-Bench Eval Harness MVP (Phase 1, completed)
Labels: enhancement, terminal-bench, harbor-integration, completed