Skip to content

Conversation

@bradleyshep
Copy link
Contributor

@bradleyshep bradleyshep commented Oct 25, 2025

Description of Changes

Introduce a new LLM benchmarking app and supporting code.

  • CLI: llm with subcommands run, routes list, diff, ci-check.
  • Runner: executes globally numbered tasks; filters by --lang, --categories, --tasks, --providers, --models.
  • Providers/clients: route layer (provider:model) with HTTP LLM Vendor clients; env-driven keys/base URLs.
  • Evaluation: deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages.
  • Results: stable JSON schema; single-file HTML viewer to inspect/filter/export CSV.
  • Build & guards: build script for compile-time setup;
  • Docs: DEVELOP.md includes cargo llm … usage.

This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer).

How it works

  1. Pick what to run

    • Choose tasks (--tasks 0,7,12), or a language (--lang rust|csharp), or categories (--categories basics,schema).
    • Optionally limit vendors/models (--providers …, --models …).
  2. Resolve routes

    • Read env (API keys + base URLs) and build the active set (e.g., openai:gpt-5).
  3. Build context

    • Start Spacetime
    • Publish golden answer modules
    • Prepare prompts and send to LLM model
    • Attempt to publish LLM module
  4. Execute calls

    • Run the selected tasks within each test against selected models and languages.
  5. Score outputs

    • Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks).
    • Record the score and any short failure reason.
  6. Update results file

    • Write/update the single results JSON with task/route outcomes, timings, and summaries.

API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

Expected complexity level and risk

4/5. New CLI, routing, evaluation, and artifact format.

  • External model APIs may rate-limit/timeout; concurrency tunable via LLM_BENCH_CONCURRENCY / LLM_BENCH_ROUTE_CONCURRENCY.

Testing

I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using act.

Please verify

  • llm run --tasks 0,1,2 (explicit run)
  • llm run --lang rust --categories basics (filters)
  • llm run --categories basics,schema (multiple categories)
  • llm run --lang csharp (language switch)
  • llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5" (provider/model limits)
  • llm run --hash-only (dry integrity)
  • llm run --goldens-only (test goldens only)
  • llm run --force (skip hash check)
  • llm ci-check
  • Stats viewer loads the JSON; filtering and CSV export work
  • CI works as intended

@bfops bfops added the release-any To be landed in any release window label Oct 27, 2025
bradleyshep and others added 6 commits November 3, 2025 13:11
…ain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Add retry logic for signal-killed processes (SIGSEGV) with up to 2 retries
and 500ms delay between attempts. Also reduce C# build concurrency from 8
to 4 by default to prevent resource contention in dotnet/WASI SDK builds.

The C# concurrency can be configured via LLM_BENCH_CSHARP_CONCURRENCY env var.
Set MSBUILDDISABLENODEREUSE=1 and DOTNET_CLI_USE_MSBUILD_SERVER=0 to
prevent resource contention when running multiple dotnet publish commands
in parallel on GitHub Actions runners.

See: dotnet/msbuild#6657
@clockwork-labs-bot
Copy link
Collaborator

LLM Benchmark Results (ci-quickfix)

Language Mode Category Tests Passed Pass % Task Pass %
Rust rustdoc_json basics 20/27 74.1% 75.0%
Rust rustdoc_json schema 23/34 67.6% 60.0%
Rust rustdoc_json total 43/61 70.5% 68.2%
C# docs basics 27/27 100.0% 100.0%
C# docs schema 31/34 91.2% 90.0%
C# docs total 58/61 95.1% 95.5%

Generated at: 2026-01-06T00:39:43.087Z

@cloutiertyler
Copy link
Contributor

I think we're okay to merge this now that /update-llm-benchmark is now able to run the fix automatically on github.

@jdetter jdetter force-pushed the bradley/llm-benchmark branch from e51b4e2 to 04eb91a Compare January 6, 2026 16:39
@jdetter jdetter mentioned this pull request Jan 6, 2026
2 tasks
@cloutiertyler cloutiertyler added this pull request to the merge queue Jan 6, 2026
Merged via the queue into master with commit b75bf6d Jan 6, 2026
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-any To be landed in any release window

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants