Skip to content

Commit aad5ec5

Browse files
heck yeah did a bunch of the inspect-ai for dafnybench code
1 parent 710fadf commit aad5ec5

File tree

14 files changed

+972
-18
lines changed

14 files changed

+972
-18
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ wheels/
2020

2121
# Virtual environments
2222
.venv
23+
24+
# Environment variables
25+
.env

CLAUDE.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
This is a formal verification cookbook that bridges formal verification and agentic language systems. It contains:
88
- **book**: Jupyter Book documentation about formal verification agents (Dafny, Lean, RL)
9-
- **evals**: Evaluation framework using pydantic-ai
9+
- **evals**: Evaluation code to be imported/viewed in the ebook but also, given an API key, work e2e.
1010

1111
The project uses a uv workspace with two members (book, evals).
1212

@@ -63,10 +63,10 @@ uv sync
6363

6464
### Evals Package
6565

66-
- Entry point: `evals:main` defined in project.scripts
67-
- Uses pydantic-ai for LLM-based evaluations
68-
- Uses datasets library for data handling
69-
- Build backend: uv_build
66+
- in `evals.dafnybench`, we're working with `wendy-sun/DafnyBench` on huggingface.
67+
- `evals.dafnybench.inspect_ai` will do it with the `inspect_ai` framework
68+
- `evals.dafnybench.rawdog` will do it with pure python and the anthropic SDK.
69+
- in `evals.fvapps`, we're working with `quinn-dougherty/fvapps` on huggingface, which we'll do with `pydantic_ai`
7070

7171
### Book Architecture
7272

README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
# Agent and eval recipes for the formal methodsititian
22

3-
### Building the Book
3+
## Running the code (`evals`)
4+
5+
I think the book is a valuable read if you're not literally poking at and running the code. The code exists to be excerpted in the book, but I also want it to run e2e for real.
6+
7+
Need `ANTHROPIC_API_KEY` to be set. Defaults are to run sonnet on not that many samples, so toying around with the ebook code should only cost like 5 bucks tops.
8+
9+
To run DafnyBench (the `inspect-ai` and `rawdog` samples), must have `dafny` installed. To run FVAPPS (the `pydantic-ai` samples), must have `elan` installed and I think run the initial toolchain configuration command.
10+
11+
## Building the book (`book`)
12+
13+
I don't know why you'd do this. It's on `recipes.for-all.dev`.
414

515
```bash
616
# Navigate to the book directory

book/01-introduction/00-ch0.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ So we will be working with the Lean prover, but not with math. This is an ideolo
2525

2626
### Who is it for
2727

28-
It's primarily for formal verification experts who want a piece of the evals and RL environments scene. People who already know about evals or RL envs are less likely to find new stuff here, as I'm mainly doing what I perceive to be standard wisdoms in the agentdev space.
28+
It's primarily for formal verification experts who want a piece of the evals and RL environments scene. People who already know about evals or RL envs are less likely to find new stuff here, as I'm mainly doing what I perceive to be standard wisdoms in the agentdev space. That's not entirely true, I'll be opinionated a bunch in what follows, but its mostly about stupid stuff like "`typer` is the correct way to make CLIs" and "`pydantic` types everywhere" and "Eun Sun Kim has the boldest rubato choices of any Parsifal conductor".
2929

3030
### Written mostly end of 2025
3131

32-
The world is moving quickly enough that I expect this book will be useful for 6 months tops, before the ground shifts under its feet.
32+
The world is moving quickly enough that I expect this book will be useful for 6 months, before the ground shifts under its feet.
3333

3434
### Who is the questgiver
3535

book/03-lean/01-ch1.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# FVAPPS
2+

book/03-lean/02-ch2.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Pydantic AI
2+

book/03-lean/03-ch3.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# With and without MCP
2+

evals/README.md

2.7 KB
Binary file not shown.

evals/pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ dependencies = [
1111
"pydantic>=2.12.4",
1212
"pydantic-ai>=1.19.0",
1313
"datasets>=4.4.1",
14+
"inspect-ai>=0.3.146",
15+
"typer>=0.20.0",
16+
"anthropic>=0.73.0",
1417
]
1518

1619
[project.scripts]

evals/src/evals/__init__.py

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,98 @@
1+
"""
2+
Evals CLI for formal verification benchmarks.
3+
"""
4+
5+
import typer
6+
from typing_extensions import Annotated
7+
8+
app = typer.Typer(help="Evaluation tools for formal verification benchmarks")
9+
10+
11+
@app.command()
12+
def solve(
13+
benchmark: Annotated[
14+
str,
15+
typer.Argument(help="Benchmark name (e.g., 'dafnybench')"),
16+
],
17+
framework: Annotated[
18+
str,
19+
typer.Option(
20+
"--framework",
21+
"-f",
22+
help="Framework to use: 'inspect' or 'rawdog'",
23+
),
24+
] = "inspect",
25+
max_attempts: Annotated[
26+
int | None,
27+
typer.Option(
28+
"--max-attempts",
29+
"-n",
30+
help="Maximum verification attempts with error feedback (default: let Inspect AI decide)",
31+
),
32+
] = None,
33+
model: Annotated[
34+
str | None,
35+
typer.Option(
36+
"--model",
37+
"-m",
38+
help="Model to evaluate (e.g., 'anthropic/claude-3-5-sonnet-20241022')",
39+
),
40+
] = None,
41+
limit: Annotated[
42+
int,
43+
typer.Option(
44+
"--limit",
45+
"-l",
46+
help="Limit number of samples to evaluate (default: 10, use -1 for all 782 samples)",
47+
),
48+
] = 10,
49+
) -> None:
50+
"""
51+
Run evaluation on a formal verification benchmark.
52+
53+
Examples:
54+
55+
# Run DafnyBench with Inspect AI (default: 10 samples, natural iteration)
56+
uv run solve dafnybench --framework inspect
57+
58+
# Run with all 782 samples
59+
uv run solve dafnybench --framework inspect --limit -1
60+
61+
# Run with explicit max attempts
62+
uv run solve dafnybench --framework inspect --max-attempts 3
63+
64+
# Run with specific model
65+
uv run solve dafnybench -f inspect -m anthropic/claude-3-5-sonnet-20241022
66+
"""
67+
if benchmark.lower() == "dafnybench":
68+
if framework.lower() == "inspect":
69+
from evals.dafnybench.inspect_ai import run_dafnybench_eval
70+
71+
# Convert limit=-1 to None (all samples)
72+
eval_limit = None if limit == -1 else limit
73+
74+
if max_attempts is not None:
75+
typer.echo(f"Running DafnyBench with Inspect AI (max_attempts={max_attempts}, limit={limit if limit != -1 else 'all'})...")
76+
else:
77+
typer.echo(f"Running DafnyBench with Inspect AI (natural iteration, limit={limit if limit != -1 else 'all'})...")
78+
run_dafnybench_eval(
79+
max_attempts=max_attempts,
80+
model=model,
81+
limit=eval_limit,
82+
)
83+
elif framework.lower() == "rawdog":
84+
typer.echo("rawdog framework not yet implemented", err=True)
85+
raise typer.Exit(code=1)
86+
else:
87+
typer.echo(f"Unknown framework: {framework}", err=True)
88+
typer.echo("Available frameworks: inspect, rawdog", err=True)
89+
raise typer.Exit(code=1)
90+
else:
91+
typer.echo(f"Unknown benchmark: {benchmark}", err=True)
92+
typer.echo("Available benchmarks: dafnybench", err=True)
93+
raise typer.Exit(code=1)
94+
95+
196
def main() -> None:
2-
print("Hello from evals!")
97+
"""Entry point for the CLI."""
98+
app()

0 commit comments

Comments
 (0)