ambient-code · jeremyeder · Dec 10, 2025 · Dec 5, 2025 · Dec 8, 2025 · Dec 8, 2025
diff --git a/.github/workflows/tests_simplified.yml b/.github/workflows/tests_simplified.yml
@@ -0,0 +1,93 @@
+name: Tests (Simplified)
+
+on:
+  pull_request:
+  push:
+    branches: [main, master]
+  workflow_dispatch:
+
+jobs:
+  # Combined blocking tests and linting in one job to reduce CI runtime
+  blocking-checks:
+    name: Blocking Tests & Quality Checks
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.12', '3.13']
+
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      # Run code quality checks (only on one Python version to save time)
+      - name: Code Quality Checks
+        if: matrix.python-version == '3.13'
+        run: |
+          black --check .
+          isort --check .
+          ruff check .
+
+      # Run critical tests
+      - name: Run Critical Tests
+        run: |
+          pytest tests/e2e/test_critical_paths.py tests/unit/cli/test_main.py tests/unit/test_models.py \
+            -v --no-cov --tb=short
+        timeout-minutes: 5
+
+  # Non-blocking comprehensive tests
+  comprehensive-tests:
+    name: Full Test Suite (Non-blocking)
+    runs-on: ubuntu-latest
+    continue-on-error: true  # Don't fail CI
+
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.13'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+
+      - name: Run all tests with coverage
+        run: |
+          pytest tests/unit/ --cov=src --cov-report=xml --cov-report=html --cov-report=term
+        continue-on-error: true
+        timeout-minutes: 20
+
+      - name: Upload coverage
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-report
+          path: htmlcov/
+          retention-days: 30
+
+  # Platform testing (simplified to single job)
+  platform-test:
+    name: macOS Compatibility
+    runs-on: macos-latest
+    continue-on-error: true
+
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.13'
+
+      - name: Install and test
+        run: |
+          python -m pip install --upgrade pip
+          pip install -e ".[dev]"
+          pytest tests/e2e/test_critical_paths.py tests/unit/cli/test_main.py \
+            -v --no-cov --tb=short || echo "Tests failed but continuing"
+        timeout-minutes: 10
diff --git a/.gitignore b/.gitignore
@@ -56,6 +56,11 @@ coverage.xml
 plans/  # Planning documents (was .plans/)
 .cache/
 
+# Harbor framework temp directories
+**/tbench-results/
+**/.harbor-cache/
+jobs/  # Harbor benchmark output directory
+
 # Repository lists (generated/temporary)
 repos.txt
 *-repos.txt

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,28 +10,24 @@
 
 ### Bug Fixes
 
-* resolve all test suite failures - achieve zero failures ([#180](https://github.com/ambient-code/agentready/issues/180)) ([990fa2d](https://github.com/ambient-code/agentready/commit/990fa2d4725842df60af151d1ba058cd43a90d3c)), closes [#148](https://github.com/ambient-code/agentready/issues/148) [#147](https://github.com/ambient-code/agentready/issues/147) [#145](https://github.com/ambient-code/agentready/issues/145)
-* resolve YAML syntax error in update-docs workflow and add actionlint ([#173](https://github.com/ambient-code/agentready/issues/173)) ([97b06af](https://github.com/ambient-code/agentready/commit/97b06af1d2adc17ec385d658310f3562f19b1a95))
+* disable attestations for Test PyPI to avoid conflict ([#155](https://github.com/jeremyeder/agentready/issues/155)) ([a33e3cd](https://github.com/jeremyeder/agentready/commit/a33e3cd2d86d4a461701e906070ab3eae8ca8082)), closes [pypa/#action-pypi-publish](https://github.com/jeremyeder/agentready/issues/action-pypi-publish)
+* leaderboard workflow and SSH URL support ([#147](https://github.com/jeremyeder/agentready/issues/147)) ([de28cd0](https://github.com/jeremyeder/agentready/commit/de28cd0a6037a0951ba370aa73832553c088cfb8))
+* resolve 45 test failures across CLI, services, and assessors ([#4](https://github.com/jeremyeder/agentready/issues/4)) ([3405142](https://github.com/jeremyeder/agentready/commit/340514251d40f283afa24d5c3068f294727fd839)), closes [#178](https://github.com/jeremyeder/agentready/issues/178) [#178](https://github.com/jeremyeder/agentready/issues/178)
+* resolve broken links and workflow failures ([#160](https://github.com/jeremyeder/agentready/issues/160)) ([fbf5cf7](https://github.com/jeremyeder/agentready/commit/fbf5cf7a1fdcb65ef4d3943a2d84e46aa831d337))
+* skip PR comments for external forks to prevent permission errors ([#163](https://github.com/jeremyeder/agentready/issues/163)) ([2a29fb8](https://github.com/jeremyeder/agentready/commit/2a29fb84485a1ac6beff1675131bf50c1b702585))
 
 
 ### Features
 
-* replace markdown-link-check with lychee for link validation ([#177](https://github.com/ambient-code/agentready/issues/177)) ([f1a4545](https://github.com/ambient-code/agentready/commit/f1a4545e4718b735df3e1fa7e0b60eba9ed0173b))
-* Terminal-Bench eval harness (MVP Phase 1) ([#178](https://github.com/ambient-code/agentready/issues/178)) ([d06bab4](https://github.com/ambient-code/agentready/commit/d06bab42848847df26d83c7a44e5ee0e84ae0445)), closes [#171](https://github.com/ambient-code/agentready/issues/171)
+* add ambient-code/agentready to leaderboard ([#148](https://github.com/jeremyeder/agentready/issues/148)) ([621152e](https://github.com/jeremyeder/agentready/commit/621152e46bd8e9505e3bc1775d2cd61a80af5a62))
+* add quay/quay to leaderboard ([#162](https://github.com/jeremyeder/agentready/issues/162)) ([d6e8df0](https://github.com/jeremyeder/agentready/commit/d6e8df0e9d92c4ec82004c5e62c798986feb1000))
+* Add weekly research update skill and automation ([#145](https://github.com/jeremyeder/agentready/issues/145)) ([7ba17a6](https://github.com/jeremyeder/agentready/commit/7ba17a6b045251cbc9f26b5c2f4a0ec31d89dd11))
+* automate PyPI publishing with trusted publishing (OIDC) ([#154](https://github.com/jeremyeder/agentready/issues/154)) ([71f4632](https://github.com/jeremyeder/agentready/commit/71f4632cb188d8c9db377c9f216c047e20727f99)), closes [pypa/#action-pypi-publish](https://github.com/jeremyeder/agentready/issues/action-pypi-publish)
 
-## [2.14.1](https://github.com/ambient-code/agentready/compare/v2.14.0...v2.14.1) (2025-12-05)
 
+### Performance Improvements
 
-### Bug Fixes
-
-* resolve YAML syntax error in continuous-learning workflow ([#172](https://github.com/ambient-code/agentready/issues/172)) ([3d40fcc](https://github.com/ambient-code/agentready/commit/3d40fcccd4e8d722303d322716454869ca7db9d0))
-
-# [2.14.0](https://github.com/ambient-code/agentready/compare/v2.13.0...v2.14.0) (2025-12-05)
-
-
-### Features
-
-* container support ([#171](https://github.com/ambient-code/agentready/issues/171)) ([c6874ea](https://github.com/ambient-code/agentready/commit/c6874ea035775ac86ef5012bbfdf52e7b96f556f))
+* implement lazy loading for heavy CLI commands ([#151](https://github.com/jeremyeder/agentready/issues/151)) ([6a7cd4e](https://github.com/jeremyeder/agentready/commit/6a7cd4e147ebfdfc95921b86599a5b650db76153))
 
 # [2.13.0](https://github.com/ambient-code/agentready/compare/v2.12.3...v2.13.0) (2025-12-04)
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -192,133 +192,6 @@ class MyAssessor(BaseAssessor):
 
 ---
 
-## Terminal-Bench Eval Harness
-
-**Purpose**: Empirically measure the impact of AgentReady assessors on Terminal-Bench performance through systematic A/B testing.
-
-### Overview
-
-The eval harness tests each assessor independently to measure its specific impact on agentic development benchmarks. This provides evidence-based validation of AgentReady's recommendations.
-
-**Architecture**:
-1. **Baseline**: Run Terminal-Bench on unmodified repository (5 iterations)
-2. **Per-Assessor Test**: Apply single assessor remediation → measure delta
-3. **Aggregate**: Rank assessors by impact, calculate tier statistics
-4. **Dashboard**: Generate interactive visualization for GitHub Pages
-
-**Components**:
-- `src/agentready/services/eval_harness/` - Core services (TbenchRunner, BaselineEstablisher, AssessorTester, ResultsAggregator, DashboardGenerator)
-- `src/agentready/models/eval_harness.py` - Data models (TbenchResult, BaselineMetrics, AssessorImpact, EvalSummary)
-- `src/agentready/cli/eval_harness.py` - CLI commands (baseline, test-assessor, run-tier, summarize, dashboard)
-- `docs/tbench.md` - Interactive dashboard with Chart.js
-- `docs/tbench/methodology.md` - Detailed statistical methodology
-
-### Running Evaluations
-
-```bash
-# 1. Establish baseline (run Terminal-Bench 5 times on unmodified repo)
-agentready eval-harness baseline --repo . --iterations 5
-
-# 2. Test single assessor
-agentready eval-harness test-assessor \
-  --assessor-id claude_md_file \
-  --iterations 5
-
-# 3. Test all Tier 1 assessors
-agentready eval-harness run-tier --tier 1 --iterations 5
-
-# 4. Aggregate results (rank by impact, calculate statistics)
-agentready eval-harness summarize --verbose
-
-# 5. Generate dashboard data files for GitHub Pages
-agentready eval-harness dashboard --verbose
-```
-
-### File Structure
-
-```
-.agentready/eval_harness/          # Results storage (gitignored)
-├── baseline/
-│   ├── run_001.json              # Individual tbench runs
-│   ├── run_002.json
-│   ├── ...
-│   └── summary.json              # BaselineMetrics
-├── assessors/
-│   ├── claude_md_file/
-│   │   ├── finding.json          # Assessment result
-│   │   ├── fixes_applied.log     # Remediation log
-│   │   ├── run_001.json          # Post-remediation runs
-│   │   ├── ...
-│   │   └── impact.json           # AssessorImpact metrics
-│   └── ...
-└── summary.json                   # EvalSummary (ranked impacts)
-
-docs/_data/tbench/                 # Dashboard data (committed)
-├── summary.json
-├── ranked_assessors.json
-├── tier_impacts.json
-├── baseline.json
-└── stats.json
-```
-
-### Statistical Methods
-
-**Significance Criteria** (both required):
-- **P-value < 0.05**: 95% confidence (two-sample t-test)
-- **|Cohen's d| > 0.2**: Meaningful effect size
-
-**Effect Size Interpretation**:
-- **0.2 ≤ |d| < 0.5**: Small effect
-- **0.5 ≤ |d| < 0.8**: Medium effect
-- **|d| ≥ 0.8**: Large effect
-
-### Current Status
-
-**Phase 1 (MVP)**: Mocked Terminal-Bench integration ✅
-- All core services implemented and tested
-- CLI commands functional
-- Dashboard with Chart.js visualizations
-- 6 CLI unit tests + 5 integration tests passing
-
-**Phase 2 (Planned)**: Real Terminal-Bench integration
-- Harbor framework client
-- Actual benchmark submissions
-- Leaderboard integration
-
-### Testing
-
-```bash
-# Run eval harness tests
-pytest tests/unit/test_eval_harness*.py -v
-pytest tests/integration/test_eval_harness_e2e.py -v
-```
-
-**Test Coverage**:
-- Models: 90-95%
-- Services: 85-90%
-- CLI: 100% (help commands validated)
-- Integration: End-to-end workflow tested
-
-### Troubleshooting
-
-**Issue**: `FileNotFoundError: Baseline directory not found`
-**Solution**: Run `agentready eval-harness baseline` first
-
-**Issue**: `No assessor results found`
-**Solution**: Run `agentready eval-harness test-assessor` or `run-tier` first
-
-**Issue**: Mocked scores seem unrealistic
-**Solution**: This is expected in Phase 1 (mocked mode) - real integration coming in Phase 2
-
-### Documentation
-
-- **User Guide**: `docs/eval-harness-guide.md` - Step-by-step tutorials
-- **Methodology**: `docs/tbench/methodology.md` - Statistical methods explained
-- **Dashboard**: `docs/tbench.md` - Interactive results visualization
-- **Plan**: `.claude/plans/quirky-squishing-plum.md` - Implementation roadmap
-
----
-
 ## Project Structure
 
 ```
@@ -352,6 +225,34 @@ agentready/
 - **Black** - Code formatter
 - **isort** - Import sorter
 - **Ruff** - Fast Python linter
+- **Harbor** - Evaluation framework (optional, for benchmarks)
+
+---
+
+## Preflight Checks
+
+AgentReady validates dependencies before running benchmarks:
+
+- **Harbor CLI**: Checked automatically before Terminal-Bench runs
+- **Interactive installation**: Prompts user with `uv tool install harbor` (or `pip install harbor` fallback)
+- **Opt-out**: Use `--skip-preflight` flag to bypass checks for advanced users
+- **Package manager fallback**: Prefers `uv`, falls back to `pip` if `uv` not available
+- **Security**: Uses `safe_subprocess_run()` with 5-minute timeout
+
+**Implementation**:
+- Module: `src/agentready/utils/preflight.py`
+- Tests: `tests/unit/utils/test_preflight.py` (100% coverage)
+- Integration: `src/agentready/cli/benchmark.py`
+
+**Usage Examples**:
+
+```bash
+# Normal usage (preflight check runs automatically)
+agentready benchmark --subset smoketest
+
+# Skip preflight (advanced users)
+agentready benchmark --subset smoketest --skip-preflight
+```
 
 ---
 
@@ -520,3 +421,11 @@ Use the @agent-github-pages-docs to [action] based on:
 **Last Updated**: 2025-12-10 by Jeremy Eder
 **AgentReady Version**: 2.16.0
 **Self-Assessment**: 80.0/100 (Gold) ✨
+
+## Active Technologies
+- Python 3.11+ (AgentReady standard, aligns with "N and N-1" policy) (002-harbor-real-integration)
+- File-based (Harbor outputs to `--jobs-dir`, JSON results parsed from filesystem) (002-harbor-real-integration)
+
+## Recent Changes
+- 002-harbor-real-integration: Added Python 3.11+ (AgentReady standard, aligns with "N and N-1" policy)
+- Build a generic interfaces first, then build consumers of that interface. This approach forces our interfaces to be more generic, pluggable and simple to extend.
diff --git a/README.md b/README.md
@@ -90,6 +90,27 @@ After installing globally:
 agentready assess .
 ```
 
+### Harbor CLI (for Benchmarks)
+
+Harbor is required for running Terminal-Bench evaluations:
+
+```bash
+# AgentReady will prompt to install automatically, or install manually:
+uv tool install harbor
+
+# Alternative: Use pip if uv is not available
+pip install harbor
+
+# Verify installation
+harbor --version
+```
+
+**Skip automatic checks**: If you prefer to skip the automatic Harbor check (for advanced users):
+
+```bash
+agentready benchmark --skip-preflight --subset smoketest
+```
+
 ### Assessment Only
 
 For one-time analysis without infrastructure changes: