Add benchmark harness and baseline results by jeremy · Pull Request #1 · basecamp/basecamp-cli

jeremy · 2026-01-16T20:59:03Z

Summary

Benchmark framework for comparing bcq CLI vs raw curl+jq API approaches.

Infrastructure

harness/ with matrix.sh, run.sh, triage.sh (VERSION=1)
12 task definitions (canonical: Task 12 overdue sweep)
inject-proxy.sh for deterministic 429/401 testing
Neutral validation via validate.sh (no bcq in validation path)

Skills for benchmark conditions

.claude-plugin/skills/bcq-basecamp/ (uses bcq CLI)
.claude-plugin/skills/raw-api-basecamp/ (curl + jq only)

Canonical results (baseline_soft_anchor_env_today, N=5)

Reliability

Model	bcq	raw
claude-sonnet	5/5 (100%)	5/5 (100%)
claude-haiku	5/5 (100%)	5/5 (100%)
gpt-5-mini	5/5 (100%)	0/5 (0%)
gpt-5-nano	5/5 (100%)	0/5 (0%)

Efficiency

Model	bcq turns	raw turns	bcq $/success	raw $/success
claude-sonnet	2.0	10.0	$0.016	$0.26
claude-haiku	3.0	24.6	$0.008	$0.56
gpt-5-mini	3.0	—	$0.005	—
gpt-5-nano	2.8	—	$0.001	—

bcq is 16× cheaper for Sonnet, 70× cheaper for Haiku.

Policy

BENCHMARKING.md defines quality gates (Smoke/Regression/Refresh)
reports/baseline.json is machine-readable with audit metadata
results/ is gitignored (ephemeral)

bcq library changes

Add pagination support via api_get_all
Add bcq todos sweep for bulk overdue processing

Test plan

./test/run.sh passes (315 tests)
Benchmark harness runs against live Basecamp instance
Triage classifies runs correctly

Benchmark framework for comparing bcq vs raw-API approaches: Infrastructure: - harness/ with matrix.sh, run.sh, triage.sh (VERSION=1) - 12 task definitions (canonical: Task 12 overdue sweep) - inject-proxy.sh for deterministic 429/401 testing - Neutral validation via validate.sh (no bcq in validation path) Skills for benchmark conditions: - .claude-plugin/skills/bcq-basecamp/ (uses bcq CLI) - .claude-plugin/skills/raw-api-basecamp/ (curl + jq only) Canonical results (baseline_soft_anchor_env_today, N=5): - bcq: 100% pass rate across all models - raw: 100% for Claude models, 0% for GPT models - bcq is 16× cheaper for Sonnet, 70× cheaper for Haiku Policy: - BENCHMARKING.md defines quality gates (Smoke/Regression/Refresh) - reports/baseline.json is machine-readable with audit metadata - results/ is gitignored (ephemeral) bcq library changes: - Add pagination support via api_get_all - Add `bcq todos sweep` for bulk overdue processing

Give WebFetch access to consult official docs instead of pre-documenting all endpoints. Fairer benchmark comparison.

- Default to https://raw.githubusercontent.com/basecamp/bc3-api/... - Cache to ~/.cache/bcq/api-docs (BCQ_API_DOCS_CACHE_DIR override) - Prefer local clone if present for dev - Soften efficiency contract to anchor (correctness > speed)

Outputs README path: local clone if present, else cached from remote. Supports BCQ_API_DOCS_URL and BCQ_API_DOCS_CACHE_DIR overrides.

Standalone skill for 'what endpoint?' questions. Uses scripts/api-docs.sh, doesn't require execution. raw-api-basecamp remains self-sufficient.

- Remove hardcoded Documentation Structure table - Use ripgrep to find sections dynamically - Remove Common Questions cheat-sheet - Keep skill focused on how to fetch/navigate docs

Add benchmark harness and baseline results

Pin to v0.0.0-20260305004813-bc5ad283b855 (bc5ad28, main HEAD after PR #1 merge). Provides output, credstore, pkce, and oauthcallback packages extracted from this repo.

…cli module (#192) * Add github.com/basecamp/cli shared module dependency Pin to v0.0.0-20260305004813-bc5ad283b855 (bc5ad28, main HEAD after PR #1 merge). Provides output, credstore, pkce, and oauthcallback packages extracted from this repo. * Migrate internal/output to consume shared cli/output package Re-export exit codes, error codes, ExitCodeFor, NormalizeData, TruncationNotice, and TruncationNoticeWithTotal from the shared module. Type-alias Error for zero-cost compatibility with errors.As. ErrAuth and ErrForbiddenScope stay local (app-specific hint strings). Deletes ~330 lines of duplicated code including NormalizeData helpers, unmarshalPreservingNumbers, normalizeUnmarshaled, and the corresponding BenchmarkNormalizeUnmarshaled (now covered by shared module tests). * Migrate internal/auth to consume shared credstore, pkce, oauthcallback Replace keyring.go implementation (~230 lines of keyring probing, file I/O, atomic writes, Windows workarounds) with a typed wrapper around credstore.Store (~70 lines). Credentials struct and Store API unchanged. Replace PKCE helpers (generateCodeVerifier/Challenge/State) with pkce.GenerateVerifier/Challenge/State. Replace waitForCallback with inline listener creation + oauthcallback.WaitForCallback. Delete tests for removed unexported functions (TestGenerateCodeVerifier, TestGenerateCodeChallenge, TestGenerateState, TestKeyFunction). Update remaining tests to construct Store via newTestStore helper. * Pin github.com/basecamp/cli to v0.1.0 release tag Replaces pseudo-version v0.0.0-20260305004813-bc5ad283b855. Same code (bc5ad28), proper semver tag. * Use wrapper functions instead of var for re-exported symbols Mutable vars allow accidental reassignment from other packages in the module. Thin wrapper functions preserve immutability while delegating to the shared module.

Rewrite Agent Invariants #1 and #5 to guide agents toward --md for human-facing output and --json for parsing. Replace the flat output modes code block with a goal-oriented table and add a CLI Introspection section documenting --agent --help for command discovery.

* Document output modes and CLI introspection in SKILL.md Rewrite Agent Invariants #1 and #5 to guide agents toward --md for human-facing output and --json for parsing. Replace the flat output modes code block with a goal-oriented table and add a CLI Introspection section documenting --agent --help for command discovery. * Add --md flag to root help output Surface the Markdown output flag in the curated FLAGS section of basecamp --help, alongside --json and --quiet. * Address PR review feedback on SKILL.md - Narrow invariant #5: only messages/comments convert Markdown to HTML; todos, documents, and cards send --content as-is - Fix --agent/--quiet description: errors still emit {ok:false,...} object - Remove misleading "default when piped" claim; advise explicit --json/--md - Add long, default, and usage fields to --agent --help JSON example

jeremy force-pushed the api-cli-bench branch from 304fa4f to a3ea5c0 Compare January 16, 2026 22:25

Remove hardcoded API docs from raw-api skill

01210c5

Give WebFetch access to consult official docs instead of pre-documenting all endpoints. Fairer benchmark comparison.

jeremy force-pushed the api-cli-bench branch from df4cad4 to 01210c5 Compare January 16, 2026 22:30

jeremy added 4 commits January 16, 2026 14:40

Fetch API docs from public URL with local cache

f91a6e8

- Default to https://raw.githubusercontent.com/basecamp/bc3-api/... - Cache to ~/.cache/bcq/api-docs (BCQ_API_DOCS_CACHE_DIR override) - Prefer local clone if present for dev - Soften efficiency contract to anchor (correctness > speed)

Add api-docs.sh helper for raw-api skill

a8a1730

Outputs README path: local clone if present, else cached from remote. Supports BCQ_API_DOCS_URL and BCQ_API_DOCS_CACHE_DIR overrides.

Add basecamp-api-reference skill for docs lookup

50c9f56

Standalone skill for 'what endpoint?' questions. Uses scripts/api-docs.sh, doesn't require execution. raw-api-basecamp remains self-sufficient.

Slim down basecamp-api-reference skill

390b298

- Remove hardcoded Documentation Structure table - Use ripgrep to find sections dynamically - Remove Common Questions cheat-sheet - Keep skill focused on how to fetch/navigate docs

jeremy merged commit 79cd2ff into main Jan 16, 2026

robzolkos mentioned this pull request Feb 1, 2026

Skill Stress Test 2 - CLI Command Syntax Errors During "Waiting on Response" Query #119

Closed

jeremy added a commit that referenced this pull request Feb 19, 2026

Merge pull request #1 from basecamp/api-cli-bench

a40248c

Add benchmark harness and baseline results

jeremy mentioned this pull request Mar 9, 2026

Document output modes for agents and surface --md in help #221

Merged

4 tasks

cubic-dev-ai bot mentioned this pull request Mar 15, 2026

Add deterministic mention syntax and harden fuzzy resolution #297

Merged

10 tasks

jeremy mentioned this pull request Mar 24, 2026

Upgrade to basecamp-sdk v0.7.1 — gauges, assignments, notifications, accounts #373

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark harness and baseline results#1

Add benchmark harness and baseline results#1
jeremy merged 6 commits intomainfrom
api-cli-bench

jeremy commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremy commented Jan 16, 2026

Summary

Infrastructure

Skills for benchmark conditions

Canonical results (baseline_soft_anchor_env_today, N=5)

Policy

bcq library changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant