Coding Agent Benchmark

A benchmark harness for evaluating and comparing coding agents (LLM-powered CLI tools) across a suite of synthetic software engineering tasks. Measures correctness and speed.

Overview

The harness runs each configured agent against a set of isolated coding tasks, then executes hidden test suites to score the results. Agents work in temporary workspaces and never see the tests, ensuring a fair evaluation.

Default agents: Claude Code (claude-opus-4-6) and Codex CLI (gpt-5.3-codex)

Metrics collected per task:

Test pass rate (correctness)
Wall-clock time

Tasks

11 tasks spanning multiple languages and categories:

#	Task	Language	Category
00	smoke-test	Python	bugfix
01	python-bugfix-csv	Python	bugfix
02	c-bugfix-linkedlist	C	bugfix
03	typescript-feature-table-filter	TypeScript	feature
04	python-feature-pagination	Python	feature
05	typescript-scratch-task-queue	TypeScript	scratch
06	cpp-scratch-lru-cache	C++	scratch
07	python-refactor-monolith	Python	refactor
08	typescript-refactor-callbacks	TypeScript	refactor
09	c-multifile-segfault	C/C++	multifile
10	fullstack-angular-python	Python/Angular	multifile

Each task contains:

tasks/<name>/
  metadata.json   # language, category, timeout
  prompt.md       # instruction given to the agent
  repo/           # source code (broken or incomplete)
  tests/          # hidden test suite

Quick Start

Prerequisites

Python 3.11+
Agent CLIs installed and on PATH (e.g. claude, codex)
Language toolchains for the tasks you want to run (Python, Node.js, GCC/G++, Make)

Install

pip install -r requirements.txt

Run

# Full benchmark (all agents, all tasks)
python3 -m harness.run

# Dry run -- test the repos as-is without invoking agents
python3 -m harness.run --dry-run

# Custom config or tasks directory
python3 -m harness.run --config path/to/config.yaml --tasks-dir path/to/tasks

CLI Options

Flag	Description
`--dry-run`	Skip agent invocation, run tests on unmodified repos
`--config PATH`	Path to config YAML (default: `harness/config.yaml`)
`--tasks-dir PATH`	Path to tasks directory (default: `tasks/`)

Configuration

Agents, test runners, and output directory are defined in harness/config.yaml.

Adding an Agent

agents:
  my-agent:
    command: "my-cli"
    args: ["--prompt", "{prompt}", "--model", "{model}"]
    model: "my-model-name"
    timeout_seconds: 300

{prompt}, {model}, and {workspace} are replaced at runtime.

Test Runners

Each language needs a test runner entry:

test_runners:
  python:
    command: "python3 -m pytest {test_dir} -v --tb=short"
    pattern: "tests/"
  typescript:
    command: "cd {workspace} && npm install --quiet 2>/dev/null && npx jest {test_dir} --verbose"
    pattern: "tests/"

Output

Results are saved to results/<run-id>/:

results/2026-02-18-154523/
  summary.json            # per-agent aggregate metrics
  claude-code/
    00-smoke-test.json    # per-task result
    ...
  codex/
    00-smoke-test.json
    ...

The terminal report shows a comparison table:

==============================================================
  BENCHMARK RESULTS
==============================================================

TASK RESULTS
--------------------------------------------------------------
Task                           claude-code     codex
--------------------------------------------------------------
00-smoke-test                  PASS 2/2 20s    PASS 2/2 29s
01-python-bugfix-csv           PASS 8/8 29s    PASS 8/8 49s
02-c-bugfix-linkedlist         PASS 8/8 18s    PASS 8/8 57s
03-typescript-feature-table-f  PASS 13/13 18s  PASS 13/13 37s
04-python-feature-pagination   PASS 10/10 19s  PASS 10/10 59s
05-typescript-scratch-task-qu  PASS 8/8 56s    PASS 8/8 60s
06-cpp-scratch-lru-cache       PASS 9/9 56s    PASS 9/9 65s
07-python-refactor-monolith    PASS 12/12 20s  PASS 12/12 43s
08-typescript-refactor-callba  PASS 10/10 35s  PASS 10/10 59s
09-c-multifile-segfault        PASS 8/8 69s    PASS 8/8 66s
10-fullstack-angular-python    PASS 10/10 22s  PASS 10/10 44s

SUMMARY
---------------------------------------------------------
Metric                    claude-code     codex
---------------------------------------------------------
Tasks fully passed        11/11           11/11
Avg correctness           1.0             1.0
Avg speed (s)             32.8            51.62

Project Structure

harness/
  config.py          # Configuration loading (AgentConfig, TestRunnerConfig)
  config.yaml        # Default agent and test runner definitions
  run.py             # Main benchmark orchestrator and CLI entry point
  runner.py          # Agent execution and workspace isolation
  task_loader.py     # Task discovery and metadata parsing
  test_executor.py   # Test execution and output parsing (pytest, jest, make)
  scoring.py         # Score aggregation and summary computation
  report.py          # Terminal report formatting
tasks/               # Benchmark task definitions
tests/               # Unit tests for the harness
docs/plans/          # Design and implementation documentation
results/             # Benchmark run output (gitignored)

Running Tests

python3 -m pytest tests/ -v

Adding a Task

Create a directory under tasks/ (prefix with a number for ordering)

Add metadata.json:

{
  "language": "python",
  "category": "bugfix",
  "timeout_seconds": 120
}

For multi-language tasks, add "test_languages": ["python", "typescript"].

Add prompt.md with the task description
Add repo/ with the source code
Add tests/ with the test suite

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs/plans		docs/plans
harness		harness
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Agent Benchmark

Overview

Tasks

Quick Start

Prerequisites

Install

Run

CLI Options

Configuration

Adding an Agent

Test Runners

Output

Project Structure

Running Tests

Adding a Task

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

usamadar/coding-agent-benchmark

Folders and files

Latest commit

History

Repository files navigation

Coding Agent Benchmark

Overview

Tasks

Quick Start

Prerequisites

Install

Run

CLI Options

Configuration

Adding an Agent

Test Runners

Output

Project Structure

Running Tests

Adding a Task

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages