Skip to content

perf: latency/token experiment — plan + P0 harness#143

Draft
ivanmkc wants to merge 7 commits into
masterfrom
perf/reduce-latency
Draft

perf: latency/token experiment — plan + P0 harness#143
ivanmkc wants to merge 7 commits into
masterfrom
perf/reduce-latency

Conversation

@ivanmkc

@ivanmkc ivanmkc commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

Foundation for the latency + token-reduction experiment. Tracks #142.

  • Plan: docs/plans/2026-06-07-latency-token-experimentation-plan.md
  • P0 harness (scripts/experiments/): agent_run.py (Tier B orchestrator, --dry-run safe), metrics.py (RunRecord + median/p95/bootstrap CI + proxy-log parse), config.py (runners/conditions/tasks), aggregate.py, pilot task suite, LiteLLM proxy config + spend logger, GCE vm/startup.sh + vm/provision.sh.
  • Tier A: corpus_run.py --metrics-out emits per-render JSONL.
  • Tests: 20 pytest cases (dry-run, no network) green.

Not yet

  • No code-path changes to the published CLI/viewer/plugin — measurement scaffolding only.
  • Real runner invocation + rubric judge land in P1; AGY headless is the flagged risk.

Decisions baked in

  • One shared model across all runners.
  • Pilot-first ({baseline, c1}) before the full ablation.

Experiment to reduce diagram-generation latency and tokens across three
runners (Claude Code, AGY, OpenCode), measured on VMs against baseline vs a
combined-fixes build. Decisions baked in: one shared model across runners and
a pilot-first first pass.
- corpus_run.py: --metrics-out emits per-render JSONL (Tier A spine) + test
- scripts/experiments: agent_run (Tier B orchestrator, --dry-run safe),
  metrics (RunRecord, median/p95/bootstrap CI, proxy-log parse), config
  (runners/conditions/tasks), aggregate (pool per-VM streams)
- pilot task suite, LiteLLM proxy config + spend logger callback
- GCE vm/startup.sh + vm/provision.sh (env-parameterized)
- pytest suite (20 tests, dry-run, no network)
…al slice)

- podman/: Containerfile (node+python+claude-code+opencode+litellm),
  run_local.sh (proxy container + per-cell containers + aggregate),
  entrypoint_cell.sh (build termchart per condition, run runner headless as
  non-root node, emit RunRecord)
- proxy: route shared-model -> Vertex Gemini 2.5 Flash via ADC; clean spend-log
  usage; Claude is a one-line EXPERIMENT_MODEL flip once Model Garden is enabled
- metrics: spend-log slice correlation (parse_proxy_log_slice, count_log_lines)
- cell_record.py: per-cell RunRecord from runner output + proxy spend slice
- proven: Claude Code in-container draws an ER diagram via termchart end-to-end
  on Vertex Gemini (57.7k in / 3.0k out tokens captured)

Refs #142
When TERMCHART_VIEWER_URL/TOKEN are unset, push and status return
EXIT_NO_VIEWER=4 (packages/cli/src/viewer-detect.ts:15), and the message is
'…are not set: no termchart viewer configured.' AGENTS.md claimed exit 3 with a
non-matching hint, which can mislead an agent into a wrong retry path.
The diagram-recipes examples are loaded verbatim into agent context when an
example is adapted. Pretty-printed, they were ~298 KB (the two *-matrix trees
alone ~89 KB across ~2,800 lines). Minifying to compact JSON is byte-for-byte
the same data but ~45% fewer bytes (305,190 -> 167,886), cutting tokens an agent
spends to load an example. Still valid JSON; flow-geometry.test.ts JSON.parses
them so it is unaffected.

Fix T1 from the latency/token experiment plan. Refs #142.
… gate

- entrypoint_cell.sh: OpenCode provider config (openai baseURL->proxy,
  --model openai/shared-model); capture runner exit code
- cell_record.py: success = clean exit + >=1 model call (runner-agnostic; OpenCode
  emits text not Claude JSON)
- README: runner status table (Claude Code + OpenCode working; AGY deferred -
  no custom base-URL to share the proxy/model) + matrix command

Refs #142
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants