Skip to content

Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190

@jeremyeder

Description

@jeremyeder

Overview

STATUS: ✅ COMPLETE - Technical implementation finished. Research study split to #201.

Implement real Harbor framework integration for the Terminal-Bench eval harness.

Completed: Harbor framework integration fully functional with real Terminal-Bench evaluations
Follow-on: Assessor refinement research study → #201


What Was Implemented ✅

1. Real Harbor Framework Integration

  • ✅ Implemented _real_tbench_result() with Harbor subprocess API
  • ✅ Harbor 2.0 result.json parsing
  • ✅ Environment variable handling (ANTHROPIC_API_KEY, ANTHROPIC_AUTH_TOKEN)
  • ✅ MiniMax API override fix (Harbor hardcoded issue)
  • ✅ Trajectory file path capture
  • ✅ Security fixes (path validation, timeout handling, input validation)

2. CLI Implementation

  • agentready benchmark command
  • ✅ Terminal-Bench smoketest and full benchmark support
  • ✅ Model selection (claude-haiku-4-5, claude-sonnet-4-5)
  • ✅ Timeout configuration
  • ✅ Verbose output mode
  • --skip-preflight option for advanced users

3. Preflight Checks

  • ✅ Automatic Harbor CLI detection
  • ✅ Interactive installation prompts (uv/pip fallback)
  • ✅ Terminal-Bench dataset management
  • ✅ Graceful error handling

4. Tests

  • ✅ TbenchResult model validation
  • ✅ Harbor configuration tests
  • ✅ Harbor services integration tests
  • ✅ Preflight check tests (100% coverage)

5. Documentation

  • ✅ Updated CLAUDE.md with Harbor integration details
  • ✅ Preflight check documentation
  • ✅ Usage examples in benchmark.py docstrings

What Was Split Out 📋

Assessor Refinement Research Study → Issue #201

The empirical research component (running 10-20 diverse repositories through Terminal-Bench to measure assessor impact) has been split into a separate issue. This was done because:

  1. Technical work complete: All infrastructure needed for the research study is now in place
  2. Different workflow: Research study is data collection + analysis, not implementation
  3. Separate timeline: Can be executed independently after Terminal-Bench Eval Harness - Phase 2: Real Harbor Framework Integration #190 merges

See #201 for the complete research study plan.


Success Criteria (Original vs Actual)

Criteria Status Notes
Real benchmark runs work end-to-end ✅ YES Harbor integration fully functional
Tests pass ✅ YES All tests passing
Harbor framework integration ✅ YES Complete with security fixes
CLI with subset options ✅ YES smoketest/full support
Documentation updated ✅ YES CLAUDE.md updated
10-20 benchmark runs on diverse repos 📋 #201 Split to follow-on research issue
Assessor refinement results 📋 #201 Split to follow-on research issue
docs/tbench/assessor-refinement-results.md 📋 #201 Split to follow-on research issue

Key Implementation Details

Files Changed

Core Implementation:

  • src/agentready/services/eval_harness/tbench_runner.py - Harbor subprocess integration
  • src/agentready/services/eval_harness/harbor_config.py - Configuration model
  • src/agentready/cli/benchmark.py - CLI command
  • src/agentready/utils/preflight.py - Dependency checking

Tests:

  • tests/unit/test_harbor_*.py - Harbor integration tests
  • tests/unit/utils/test_preflight.py - Preflight check tests (100% coverage)

Documentation:

  • CLAUDE.md - Updated with Harbor integration, preflight checks
  • docs/tbench/methodology.md - A/B testing methodology

Usage Examples

# Quick smoketest (1-2 tasks, ~2-5 min)
export ANTHROPIC_API_KEY=your-key-here
agentready benchmark --subset smoketest

# Full Terminal-Bench with Sonnet (~30-40 min)
agentready benchmark --subset full --model claude-sonnet-4-5

# Skip preflight checks (advanced)
agentready benchmark --subset smoketest --skip-preflight

Evidence of Completion

Harbor Integration Working:

  • Real Harbor result.json files in jobs/ directory
  • Successful Terminal-Bench task configurations
  • Trajectory file capture working
  • All tests passing

Commits (17 total):

  • 9e9cc32 - feat: display Harbor command with copy/paste format
  • 389958f - fix: override Harbor's hardcoded MiniMax API configuration
  • d6f583e - feat: display trajectory file path in benchmark summary
  • f9f6cfb - fix: set ANTHROPIC_AUTH_TOKEN for Harbor's Claude Code agent
  • 97b848f - feat: add automatic Harbor CLI preflight checks
  • b91d11b - fix: correct Harbor results parsing (Harbor 2.0 structure)
  • bdbabb2 - fix(security): implement critical security fixes
  • 4d5649a - feat: add Harbor framework integration (initial)
  • ...and 9 more commits

Follow-On Work

Immediate:

Future:

  • Dashboard integration for Terminal-Bench results
  • Historical tracking of benchmark performance
  • Leaderboard integration (if desired)

Resources

Implementation:

Related Issues:


Labels: enhancement, terminal-bench, harbor-integration, completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions