perf: latency/token experiment — plan + P0 harness#143
Draft
ivanmkc wants to merge 7 commits into
Draft
Conversation
Experiment to reduce diagram-generation latency and tokens across three runners (Claude Code, AGY, OpenCode), measured on VMs against baseline vs a combined-fixes build. Decisions baked in: one shared model across runners and a pilot-first first pass.
- corpus_run.py: --metrics-out emits per-render JSONL (Tier A spine) + test - scripts/experiments: agent_run (Tier B orchestrator, --dry-run safe), metrics (RunRecord, median/p95/bootstrap CI, proxy-log parse), config (runners/conditions/tasks), aggregate (pool per-VM streams) - pilot task suite, LiteLLM proxy config + spend logger callback - GCE vm/startup.sh + vm/provision.sh (env-parameterized) - pytest suite (20 tests, dry-run, no network)
…al slice) - podman/: Containerfile (node+python+claude-code+opencode+litellm), run_local.sh (proxy container + per-cell containers + aggregate), entrypoint_cell.sh (build termchart per condition, run runner headless as non-root node, emit RunRecord) - proxy: route shared-model -> Vertex Gemini 2.5 Flash via ADC; clean spend-log usage; Claude is a one-line EXPERIMENT_MODEL flip once Model Garden is enabled - metrics: spend-log slice correlation (parse_proxy_log_slice, count_log_lines) - cell_record.py: per-cell RunRecord from runner output + proxy spend slice - proven: Claude Code in-container draws an ER diagram via termchart end-to-end on Vertex Gemini (57.7k in / 3.0k out tokens captured) Refs #142
When TERMCHART_VIEWER_URL/TOKEN are unset, push and status return EXIT_NO_VIEWER=4 (packages/cli/src/viewer-detect.ts:15), and the message is '…are not set: no termchart viewer configured.' AGENTS.md claimed exit 3 with a non-matching hint, which can mislead an agent into a wrong retry path.
The diagram-recipes examples are loaded verbatim into agent context when an example is adapted. Pretty-printed, they were ~298 KB (the two *-matrix trees alone ~89 KB across ~2,800 lines). Minifying to compact JSON is byte-for-byte the same data but ~45% fewer bytes (305,190 -> 167,886), cutting tokens an agent spends to load an example. Still valid JSON; flow-geometry.test.ts JSON.parses them so it is unaffected. Fix T1 from the latency/token experiment plan. Refs #142.
… gate - entrypoint_cell.sh: OpenCode provider config (openai baseURL->proxy, --model openai/shared-model); capture runner exit code - cell_record.py: success = clean exit + >=1 model call (runner-agnostic; OpenCode emits text not Claude JSON) - README: runner status table (Claude Code + OpenCode working; AGY deferred - no custom base-URL to share the proxy/model) + matrix command Refs #142
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundation for the latency + token-reduction experiment. Tracks #142.
docs/plans/2026-06-07-latency-token-experimentation-plan.mdscripts/experiments/):agent_run.py(Tier B orchestrator,--dry-runsafe),metrics.py(RunRecord + median/p95/bootstrap CI + proxy-log parse),config.py(runners/conditions/tasks),aggregate.py, pilot task suite, LiteLLM proxy config + spend logger, GCEvm/startup.sh+vm/provision.sh.corpus_run.py --metrics-outemits per-render JSONL.Not yet
Decisions baked in