Skip to content

Add benchmark harness and baseline results#1

Merged
jeremy merged 6 commits intomainfrom
api-cli-bench
Jan 16, 2026
Merged

Add benchmark harness and baseline results#1
jeremy merged 6 commits intomainfrom
api-cli-bench

Conversation

@jeremy
Copy link
Copy Markdown
Member

@jeremy jeremy commented Jan 16, 2026

Summary

Benchmark framework for comparing bcq CLI vs raw curl+jq API approaches.

Infrastructure

  • harness/ with matrix.sh, run.sh, triage.sh (VERSION=1)
  • 12 task definitions (canonical: Task 12 overdue sweep)
  • inject-proxy.sh for deterministic 429/401 testing
  • Neutral validation via validate.sh (no bcq in validation path)

Skills for benchmark conditions

  • .claude-plugin/skills/bcq-basecamp/ (uses bcq CLI)
  • .claude-plugin/skills/raw-api-basecamp/ (curl + jq only)

Canonical results (baseline_soft_anchor_env_today, N=5)

Reliability

Model bcq raw
claude-sonnet 5/5 (100%) 5/5 (100%)
claude-haiku 5/5 (100%) 5/5 (100%)
gpt-5-mini 5/5 (100%) 0/5 (0%)
gpt-5-nano 5/5 (100%) 0/5 (0%)

Efficiency

Model bcq turns raw turns bcq $/success raw $/success
claude-sonnet 2.0 10.0 $0.016 $0.26
claude-haiku 3.0 24.6 $0.008 $0.56
gpt-5-mini 3.0 $0.005
gpt-5-nano 2.8 $0.001

bcq is 16× cheaper for Sonnet, 70× cheaper for Haiku.

Policy

  • BENCHMARKING.md defines quality gates (Smoke/Regression/Refresh)
  • reports/baseline.json is machine-readable with audit metadata
  • results/ is gitignored (ephemeral)

bcq library changes

  • Add pagination support via api_get_all
  • Add bcq todos sweep for bulk overdue processing

Test plan

  • ./test/run.sh passes (315 tests)
  • Benchmark harness runs against live Basecamp instance
  • Triage classifies runs correctly

Benchmark framework for comparing bcq vs raw-API approaches:

Infrastructure:
- harness/ with matrix.sh, run.sh, triage.sh (VERSION=1)
- 12 task definitions (canonical: Task 12 overdue sweep)
- inject-proxy.sh for deterministic 429/401 testing
- Neutral validation via validate.sh (no bcq in validation path)

Skills for benchmark conditions:
- .claude-plugin/skills/bcq-basecamp/ (uses bcq CLI)
- .claude-plugin/skills/raw-api-basecamp/ (curl + jq only)

Canonical results (baseline_soft_anchor_env_today, N=5):
- bcq: 100% pass rate across all models
- raw: 100% for Claude models, 0% for GPT models
- bcq is 16× cheaper for Sonnet, 70× cheaper for Haiku

Policy:
- BENCHMARKING.md defines quality gates (Smoke/Regression/Refresh)
- reports/baseline.json is machine-readable with audit metadata
- results/ is gitignored (ephemeral)

bcq library changes:
- Add pagination support via api_get_all
- Add `bcq todos sweep` for bulk overdue processing
Give WebFetch access to consult official docs instead of
pre-documenting all endpoints. Fairer benchmark comparison.
- Default to https://raw.githubusercontent.com/basecamp/bc3-api/...
- Cache to ~/.cache/bcq/api-docs (BCQ_API_DOCS_CACHE_DIR override)
- Prefer local clone if present for dev
- Soften efficiency contract to anchor (correctness > speed)
Outputs README path: local clone if present, else cached from remote.
Supports BCQ_API_DOCS_URL and BCQ_API_DOCS_CACHE_DIR overrides.
Standalone skill for 'what endpoint?' questions.
Uses scripts/api-docs.sh, doesn't require execution.
raw-api-basecamp remains self-sufficient.
- Remove hardcoded Documentation Structure table
- Use ripgrep to find sections dynamically
- Remove Common Questions cheat-sheet
- Keep skill focused on how to fetch/navigate docs
@jeremy jeremy merged commit 79cd2ff into main Jan 16, 2026
jeremy added a commit that referenced this pull request Feb 19, 2026
Add benchmark harness and baseline results
jeremy added a commit that referenced this pull request Mar 5, 2026
Pin to v0.0.0-20260305004813-bc5ad283b855 (bc5ad28, main HEAD after
PR #1 merge). Provides output, credstore, pkce, and oauthcallback
packages extracted from this repo.
jeremy added a commit that referenced this pull request Mar 5, 2026
…cli module (#192)

* Add github.com/basecamp/cli shared module dependency

Pin to v0.0.0-20260305004813-bc5ad283b855 (bc5ad28, main HEAD after
PR #1 merge). Provides output, credstore, pkce, and oauthcallback
packages extracted from this repo.

* Migrate internal/output to consume shared cli/output package

Re-export exit codes, error codes, ExitCodeFor, NormalizeData,
TruncationNotice, and TruncationNoticeWithTotal from the shared module.
Type-alias Error for zero-cost compatibility with errors.As.

ErrAuth and ErrForbiddenScope stay local (app-specific hint strings).
Deletes ~330 lines of duplicated code including NormalizeData helpers,
unmarshalPreservingNumbers, normalizeUnmarshaled, and the corresponding
BenchmarkNormalizeUnmarshaled (now covered by shared module tests).

* Migrate internal/auth to consume shared credstore, pkce, oauthcallback

Replace keyring.go implementation (~230 lines of keyring probing, file
I/O, atomic writes, Windows workarounds) with a typed wrapper around
credstore.Store (~70 lines). Credentials struct and Store API unchanged.

Replace PKCE helpers (generateCodeVerifier/Challenge/State) with
pkce.GenerateVerifier/Challenge/State. Replace waitForCallback with
inline listener creation + oauthcallback.WaitForCallback.

Delete tests for removed unexported functions (TestGenerateCodeVerifier,
TestGenerateCodeChallenge, TestGenerateState, TestKeyFunction). Update
remaining tests to construct Store via newTestStore helper.

* Pin github.com/basecamp/cli to v0.1.0 release tag

Replaces pseudo-version v0.0.0-20260305004813-bc5ad283b855.
Same code (bc5ad28), proper semver tag.

* Use wrapper functions instead of var for re-exported symbols

Mutable vars allow accidental reassignment from other packages in
the module. Thin wrapper functions preserve immutability while
delegating to the shared module.
jeremy added a commit that referenced this pull request Mar 9, 2026
Rewrite Agent Invariants #1 and #5 to guide agents toward --md for
human-facing output and --json for parsing. Replace the flat output
modes code block with a goal-oriented table and add a CLI Introspection
section documenting --agent --help for command discovery.
jeremy added a commit that referenced this pull request Mar 9, 2026
* Document output modes and CLI introspection in SKILL.md

Rewrite Agent Invariants #1 and #5 to guide agents toward --md for
human-facing output and --json for parsing. Replace the flat output
modes code block with a goal-oriented table and add a CLI Introspection
section documenting --agent --help for command discovery.

* Add --md flag to root help output

Surface the Markdown output flag in the curated FLAGS section of
basecamp --help, alongside --json and --quiet.

* Address PR review feedback on SKILL.md

- Narrow invariant #5: only messages/comments convert Markdown to HTML;
  todos, documents, and cards send --content as-is
- Fix --agent/--quiet description: errors still emit {ok:false,...} object
- Remove misleading "default when piped" claim; advise explicit --json/--md
- Add long, default, and usage fields to --agent --help JSON example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant