Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cdc89a3
chore: update leaderboard data [skip ci]
github-actions[bot] Dec 5, 2025
3405142
fix: resolve 45 test failures across CLI, services, and assessors (#4)
jeremyeder Dec 8, 2025
f41b0d6
chore(release): 2.10.0 [skip ci]
semantic-release-bot Dec 8, 2025
16132e2
fix: resolve 45 test failures across CLI, services, and assessors (#4)
jeremyeder Dec 8, 2025
1d32bc7
chore(release): 2.10.0 [skip ci]
semantic-release-bot Dec 8, 2025
4d5649a
feat: add Harbor framework integration for real Terminal-Bench evalua…
jeremyeder Dec 9, 2025
5dadcd4
feat: implement blocking test strategy with tiered CI jobs
jeremyeder Dec 9, 2025
bdbabb2
fix(security): implement critical security fixes from code review
jeremyeder Dec 9, 2025
b91d11b
fix: correct Harbor results parsing to match actual Harbor 2.0 JSON s…
jeremyeder Dec 9, 2025
3f0a1c0
chore: save Harbor integration WIP before rebase onto v2.15.0
jeremyeder Dec 9, 2025
901310c
chore: restore version to 2.15.0 after rebase
jeremyeder Dec 9, 2025
a8fecbd
fix: remove duplicate assessor registration for architecture_decision…
jeremyeder Dec 9, 2025
af34fd0
feat: redesign assess command output with detailed results table
jeremyeder Dec 9, 2025
97b848f
fix: validate API key before HarborConfig initialization
jeremyeder Dec 9, 2025
f34c02b
feat: add automatic Harbor CLI preflight checks with dataset management
jeremyeder Dec 9, 2025
d1a99b3
Merge branch 'main' into 002-harbor-real-integration
jeremyeder Dec 9, 2025
f9f6cfb
fix: pass full environment to Harbor subprocess
jeremyeder Dec 9, 2025
d6f583e
fix: set ANTHROPIC_AUTH_TOKEN for Harbor's Claude Code agent
jeremyeder Dec 9, 2025
cebdf67
feat: display trajectory file path in benchmark summary
jeremyeder Dec 10, 2025
389958f
fix: override Harbor's hardcoded MiniMax API configuration
jeremyeder Dec 10, 2025
9e9cc32
feat: display Harbor command with copy/paste ready format
jeremyeder Dec 10, 2025
2655c29
Merge upstream/main into 002-harbor-real-integration
jeremyeder Dec 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions .github/workflows/tests_simplified.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
name: Tests (Simplified)

on:
pull_request:
push:
branches: [main, master]
workflow_dispatch:

jobs:
# Combined blocking tests and linting in one job to reduce CI runtime
blocking-checks:
name: Blocking Tests & Quality Checks
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.12', '3.13']

steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"

# Run code quality checks (only on one Python version to save time)
- name: Code Quality Checks
if: matrix.python-version == '3.13'
run: |
black --check .
isort --check .
ruff check .

# Run critical tests
- name: Run Critical Tests
run: |
pytest tests/e2e/test_critical_paths.py tests/unit/cli/test_main.py tests/unit/test_models.py \
-v --no-cov --tb=short
timeout-minutes: 5

# Non-blocking comprehensive tests
comprehensive-tests:
name: Full Test Suite (Non-blocking)
runs-on: ubuntu-latest
continue-on-error: true # Don't fail CI

steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.13'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"

- name: Run all tests with coverage
run: |
pytest tests/unit/ --cov=src --cov-report=xml --cov-report=html --cov-report=term
continue-on-error: true
timeout-minutes: 20

- name: Upload coverage
if: always()
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: htmlcov/
retention-days: 30

# Platform testing (simplified to single job)
platform-test:
name: macOS Compatibility
runs-on: macos-latest
continue-on-error: true

steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.13'

- name: Install and test
run: |
python -m pip install --upgrade pip
pip install -e ".[dev]"
pytest tests/e2e/test_critical_paths.py tests/unit/cli/test_main.py \
-v --no-cov --tb=short || echo "Tests failed but continuing"
timeout-minutes: 10
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ coverage.xml
plans/ # Planning documents (was .plans/)
.cache/

# Harbor framework temp directories
**/tbench-results/
**/.harbor-cache/
jobs/ # Harbor benchmark output directory

# Repository lists (generated/temporary)
repos.txt
*-repos.txt
Expand Down
26 changes: 11 additions & 15 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,24 @@

### Bug Fixes

* resolve all test suite failures - achieve zero failures ([#180](https://github.com/ambient-code/agentready/issues/180)) ([990fa2d](https://github.com/ambient-code/agentready/commit/990fa2d4725842df60af151d1ba058cd43a90d3c)), closes [#148](https://github.com/ambient-code/agentready/issues/148) [#147](https://github.com/ambient-code/agentready/issues/147) [#145](https://github.com/ambient-code/agentready/issues/145)
* resolve YAML syntax error in update-docs workflow and add actionlint ([#173](https://github.com/ambient-code/agentready/issues/173)) ([97b06af](https://github.com/ambient-code/agentready/commit/97b06af1d2adc17ec385d658310f3562f19b1a95))
* disable attestations for Test PyPI to avoid conflict ([#155](https://github.com/jeremyeder/agentready/issues/155)) ([a33e3cd](https://github.com/jeremyeder/agentready/commit/a33e3cd2d86d4a461701e906070ab3eae8ca8082)), closes [pypa/#action-pypi-publish](https://github.com/jeremyeder/agentready/issues/action-pypi-publish)
* leaderboard workflow and SSH URL support ([#147](https://github.com/jeremyeder/agentready/issues/147)) ([de28cd0](https://github.com/jeremyeder/agentready/commit/de28cd0a6037a0951ba370aa73832553c088cfb8))
* resolve 45 test failures across CLI, services, and assessors ([#4](https://github.com/jeremyeder/agentready/issues/4)) ([3405142](https://github.com/jeremyeder/agentready/commit/340514251d40f283afa24d5c3068f294727fd839)), closes [#178](https://github.com/jeremyeder/agentready/issues/178) [#178](https://github.com/jeremyeder/agentready/issues/178)
* resolve broken links and workflow failures ([#160](https://github.com/jeremyeder/agentready/issues/160)) ([fbf5cf7](https://github.com/jeremyeder/agentready/commit/fbf5cf7a1fdcb65ef4d3943a2d84e46aa831d337))
* skip PR comments for external forks to prevent permission errors ([#163](https://github.com/jeremyeder/agentready/issues/163)) ([2a29fb8](https://github.com/jeremyeder/agentready/commit/2a29fb84485a1ac6beff1675131bf50c1b702585))


### Features

* replace markdown-link-check with lychee for link validation ([#177](https://github.com/ambient-code/agentready/issues/177)) ([f1a4545](https://github.com/ambient-code/agentready/commit/f1a4545e4718b735df3e1fa7e0b60eba9ed0173b))
* Terminal-Bench eval harness (MVP Phase 1) ([#178](https://github.com/ambient-code/agentready/issues/178)) ([d06bab4](https://github.com/ambient-code/agentready/commit/d06bab42848847df26d83c7a44e5ee0e84ae0445)), closes [#171](https://github.com/ambient-code/agentready/issues/171)
* add ambient-code/agentready to leaderboard ([#148](https://github.com/jeremyeder/agentready/issues/148)) ([621152e](https://github.com/jeremyeder/agentready/commit/621152e46bd8e9505e3bc1775d2cd61a80af5a62))
* add quay/quay to leaderboard ([#162](https://github.com/jeremyeder/agentready/issues/162)) ([d6e8df0](https://github.com/jeremyeder/agentready/commit/d6e8df0e9d92c4ec82004c5e62c798986feb1000))
* Add weekly research update skill and automation ([#145](https://github.com/jeremyeder/agentready/issues/145)) ([7ba17a6](https://github.com/jeremyeder/agentready/commit/7ba17a6b045251cbc9f26b5c2f4a0ec31d89dd11))
* automate PyPI publishing with trusted publishing (OIDC) ([#154](https://github.com/jeremyeder/agentready/issues/154)) ([71f4632](https://github.com/jeremyeder/agentready/commit/71f4632cb188d8c9db377c9f216c047e20727f99)), closes [pypa/#action-pypi-publish](https://github.com/jeremyeder/agentready/issues/action-pypi-publish)

## [2.14.1](https://github.com/ambient-code/agentready/compare/v2.14.0...v2.14.1) (2025-12-05)

### Performance Improvements

### Bug Fixes

* resolve YAML syntax error in continuous-learning workflow ([#172](https://github.com/ambient-code/agentready/issues/172)) ([3d40fcc](https://github.com/ambient-code/agentready/commit/3d40fcccd4e8d722303d322716454869ca7db9d0))

# [2.14.0](https://github.com/ambient-code/agentready/compare/v2.13.0...v2.14.0) (2025-12-05)


### Features

* container support ([#171](https://github.com/ambient-code/agentready/issues/171)) ([c6874ea](https://github.com/ambient-code/agentready/commit/c6874ea035775ac86ef5012bbfdf52e7b96f556f))
* implement lazy loading for heavy CLI commands ([#151](https://github.com/jeremyeder/agentready/issues/151)) ([6a7cd4e](https://github.com/jeremyeder/agentready/commit/6a7cd4e147ebfdfc95921b86599a5b650db76153))

# [2.13.0](https://github.com/ambient-code/agentready/compare/v2.12.3...v2.13.0) (2025-12-04)

Expand Down
163 changes: 36 additions & 127 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,133 +192,6 @@ class MyAssessor(BaseAssessor):

---

## Terminal-Bench Eval Harness

**Purpose**: Empirically measure the impact of AgentReady assessors on Terminal-Bench performance through systematic A/B testing.

### Overview

The eval harness tests each assessor independently to measure its specific impact on agentic development benchmarks. This provides evidence-based validation of AgentReady's recommendations.

**Architecture**:
1. **Baseline**: Run Terminal-Bench on unmodified repository (5 iterations)
2. **Per-Assessor Test**: Apply single assessor remediation → measure delta
3. **Aggregate**: Rank assessors by impact, calculate tier statistics
4. **Dashboard**: Generate interactive visualization for GitHub Pages

**Components**:
- `src/agentready/services/eval_harness/` - Core services (TbenchRunner, BaselineEstablisher, AssessorTester, ResultsAggregator, DashboardGenerator)
- `src/agentready/models/eval_harness.py` - Data models (TbenchResult, BaselineMetrics, AssessorImpact, EvalSummary)
- `src/agentready/cli/eval_harness.py` - CLI commands (baseline, test-assessor, run-tier, summarize, dashboard)
- `docs/tbench.md` - Interactive dashboard with Chart.js
- `docs/tbench/methodology.md` - Detailed statistical methodology

### Running Evaluations

```bash
# 1. Establish baseline (run Terminal-Bench 5 times on unmodified repo)
agentready eval-harness baseline --repo . --iterations 5

# 2. Test single assessor
agentready eval-harness test-assessor \
--assessor-id claude_md_file \
--iterations 5

# 3. Test all Tier 1 assessors
agentready eval-harness run-tier --tier 1 --iterations 5

# 4. Aggregate results (rank by impact, calculate statistics)
agentready eval-harness summarize --verbose

# 5. Generate dashboard data files for GitHub Pages
agentready eval-harness dashboard --verbose
```

### File Structure

```
.agentready/eval_harness/ # Results storage (gitignored)
├── baseline/
│ ├── run_001.json # Individual tbench runs
│ ├── run_002.json
│ ├── ...
│ └── summary.json # BaselineMetrics
├── assessors/
│ ├── claude_md_file/
│ │ ├── finding.json # Assessment result
│ │ ├── fixes_applied.log # Remediation log
│ │ ├── run_001.json # Post-remediation runs
│ │ ├── ...
│ │ └── impact.json # AssessorImpact metrics
│ └── ...
└── summary.json # EvalSummary (ranked impacts)

docs/_data/tbench/ # Dashboard data (committed)
├── summary.json
├── ranked_assessors.json
├── tier_impacts.json
├── baseline.json
└── stats.json
```

### Statistical Methods

**Significance Criteria** (both required):
- **P-value < 0.05**: 95% confidence (two-sample t-test)
- **|Cohen's d| > 0.2**: Meaningful effect size

**Effect Size Interpretation**:
- **0.2 ≤ |d| < 0.5**: Small effect
- **0.5 ≤ |d| < 0.8**: Medium effect
- **|d| ≥ 0.8**: Large effect

### Current Status

**Phase 1 (MVP)**: Mocked Terminal-Bench integration ✅
- All core services implemented and tested
- CLI commands functional
- Dashboard with Chart.js visualizations
- 6 CLI unit tests + 5 integration tests passing

**Phase 2 (Planned)**: Real Terminal-Bench integration
- Harbor framework client
- Actual benchmark submissions
- Leaderboard integration

### Testing

```bash
# Run eval harness tests
pytest tests/unit/test_eval_harness*.py -v
pytest tests/integration/test_eval_harness_e2e.py -v
```

**Test Coverage**:
- Models: 90-95%
- Services: 85-90%
- CLI: 100% (help commands validated)
- Integration: End-to-end workflow tested

### Troubleshooting

**Issue**: `FileNotFoundError: Baseline directory not found`
**Solution**: Run `agentready eval-harness baseline` first

**Issue**: `No assessor results found`
**Solution**: Run `agentready eval-harness test-assessor` or `run-tier` first

**Issue**: Mocked scores seem unrealistic
**Solution**: This is expected in Phase 1 (mocked mode) - real integration coming in Phase 2

### Documentation

- **User Guide**: `docs/eval-harness-guide.md` - Step-by-step tutorials
- **Methodology**: `docs/tbench/methodology.md` - Statistical methods explained
- **Dashboard**: `docs/tbench.md` - Interactive results visualization
- **Plan**: `.claude/plans/quirky-squishing-plum.md` - Implementation roadmap

---

## Project Structure

```
Expand Down Expand Up @@ -352,6 +225,34 @@ agentready/
- **Black** - Code formatter
- **isort** - Import sorter
- **Ruff** - Fast Python linter
- **Harbor** - Evaluation framework (optional, for benchmarks)

---

## Preflight Checks

AgentReady validates dependencies before running benchmarks:

- **Harbor CLI**: Checked automatically before Terminal-Bench runs
- **Interactive installation**: Prompts user with `uv tool install harbor` (or `pip install harbor` fallback)
- **Opt-out**: Use `--skip-preflight` flag to bypass checks for advanced users
- **Package manager fallback**: Prefers `uv`, falls back to `pip` if `uv` not available
- **Security**: Uses `safe_subprocess_run()` with 5-minute timeout

**Implementation**:
- Module: `src/agentready/utils/preflight.py`
- Tests: `tests/unit/utils/test_preflight.py` (100% coverage)
- Integration: `src/agentready/cli/benchmark.py`

**Usage Examples**:

```bash
# Normal usage (preflight check runs automatically)
agentready benchmark --subset smoketest

# Skip preflight (advanced users)
agentready benchmark --subset smoketest --skip-preflight
```

---

Expand Down Expand Up @@ -520,3 +421,11 @@ Use the @agent-github-pages-docs to [action] based on:
**Last Updated**: 2025-12-10 by Jeremy Eder
**AgentReady Version**: 2.16.0
**Self-Assessment**: 80.0/100 (Gold) ✨

## Active Technologies
- Python 3.11+ (AgentReady standard, aligns with "N and N-1" policy) (002-harbor-real-integration)
- File-based (Harbor outputs to `--jobs-dir`, JSON results parsed from filesystem) (002-harbor-real-integration)

## Recent Changes
- 002-harbor-real-integration: Added Python 3.11+ (AgentReady standard, aligns with "N and N-1" policy)
- Build a generic interfaces first, then build consumers of that interface. This approach forces our interfaces to be more generic, pluggable and simple to extend.
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,27 @@ After installing globally:
agentready assess .
```

### Harbor CLI (for Benchmarks)

Harbor is required for running Terminal-Bench evaluations:

```bash
# AgentReady will prompt to install automatically, or install manually:
uv tool install harbor

# Alternative: Use pip if uv is not available
pip install harbor

# Verify installation
harbor --version
```

**Skip automatic checks**: If you prefer to skip the automatic Harbor check (for advanced users):

```bash
agentready benchmark --skip-preflight --subset smoketest
```

### Assessment Only

For one-time analysis without infrastructure changes:
Expand Down
Loading
Loading