for-all-dev
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 5 additions & 5 deletions b/‎CLAUDE.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 11 additions & 1 deletion b/‎README.md‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎book/01-introduction/00-ch0.md‎
Lines changed: 2 additions & 2 deletions b/‎book/01-introduction/00-ch0.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/03-lean/01-ch1.md‎
Lines changed: 2 additions & 0 deletions b/‎book/03-lean/01-ch1.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎book/03-lean/02-ch2.md‎
Lines changed: 2 additions & 0 deletions b/‎book/03-lean/02-ch2.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎book/03-lean/03-ch3.md‎
Lines changed: 2 additions & 0 deletions b/‎book/03-lean/03-ch3.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎evals/README.md‎
2.7 KB b/‎evals/README.md‎
2.7 KB
diff --git a/‎evals/pyproject.toml‎
Lines changed: 3 additions & 0 deletions b/‎evals/pyproject.toml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎evals/src/evals/__init__.py‎
Lines changed: 97 additions & 1 deletion b/‎evals/src/evals/__init__.py‎
Lines changed: 97 additions & 1 deletion
@@ -20,3 +20,6 @@ wheels/
 
 # Virtual environments
 .venv
+
+# Environment variables
+.env
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 This is a formal verification cookbook that bridges formal verification and agentic language systems. It contains:
 - **book**: Jupyter Book documentation about formal verification agents (Dafny, Lean, RL)
-- **evals**: Evaluation framework using pydantic-ai
+- **evals**: Evaluation code to be imported/viewed in the ebook but also, given an API key, work e2e. 
 
 The project uses a uv workspace with two members (book, evals).
 
@@ -63,10 +63,10 @@ uv sync
 
 ### Evals Package
 
-- Entry point: `evals:main` defined in project.scripts
-- Uses pydantic-ai for LLM-based evaluations
-- Uses datasets library for data handling
-- Build backend: uv_build
+- in `evals.dafnybench`, we're working with `wendy-sun/DafnyBench` on huggingface.
+  - `evals.dafnybench.inspect_ai` will do it with the `inspect_ai` framework
+  - `evals.dafnybench.rawdog` will do it with pure python and the anthropic SDK.
+- in `evals.fvapps`, we're working with `quinn-dougherty/fvapps` on huggingface, which we'll do with `pydantic_ai`
 
 ### Book Architecture
 
 
@@ -1,6 +1,16 @@
 # Agent and eval recipes for the formal methodsititian
 
-### Building the Book
+## Running the code (`evals`)
+
+I think the book is a valuable read if you're not literally poking at and running the code. The code exists to be excerpted in the book, but I also want it to run e2e for real.
+
+Need `ANTHROPIC_API_KEY` to be set. Defaults are to run sonnet on not that many samples, so toying around with the ebook code should only cost like 5 bucks tops.
+
+To run DafnyBench (the `inspect-ai` and `rawdog` samples), must have `dafny` installed. To run FVAPPS (the `pydantic-ai` samples), must have `elan` installed and I think run the initial toolchain configuration command.
+
+## Building the book (`book`)
+
+I don't know why you'd do this. It's on `recipes.for-all.dev`.
 
 ```bash
 # Navigate to the book directory
 
@@ -25,11 +25,11 @@ So we will be working with the Lean prover, but not with math. This is an ideolo
 
 ### Who is it for
 
-It's primarily for formal verification experts who want a piece of the evals and RL environments scene. People who already know about evals or RL envs are less likely to find new stuff here, as I'm mainly doing what I perceive to be standard wisdoms in the agentdev space. 
+It's primarily for formal verification experts who want a piece of the evals and RL environments scene. People who already know about evals or RL envs are less likely to find new stuff here, as I'm mainly doing what I perceive to be standard wisdoms in the agentdev space. That's not entirely true, I'll be opinionated a bunch in what follows, but its mostly about stupid stuff like "`typer` is the correct way to make CLIs" and "`pydantic` types everywhere" and "Eun Sun Kim has the boldest rubato choices of any Parsifal conductor".
 
 ### Written mostly end of 2025
 
-The world is moving quickly enough that I expect this book will be useful for 6 months tops, before the ground shifts under its feet. 
+The world is moving quickly enough that I expect this book will be useful for 6 months, before the ground shifts under its feet. 
 
 ### Who is the questgiver
 
 
@@ -0,0 +1,2 @@
+# FVAPPS
+
@@ -0,0 +1,2 @@
+# Pydantic AI
+
@@ -0,0 +1,2 @@
+# With and without MCP
+
@@ -11,6 +11,9 @@ dependencies = [
     "pydantic>=2.12.4",
     "pydantic-ai>=1.19.0",
     "datasets>=4.4.1",
+    "inspect-ai>=0.3.146",
+    "typer>=0.20.0",
+    "anthropic>=0.73.0",
 ]
 
 [project.scripts]
 
@@ -1,2 +1,98 @@
+"""
+Evals CLI for formal verification benchmarks.
+"""
+
+import typer
+from typing_extensions import Annotated
+
+app = typer.Typer(help="Evaluation tools for formal verification benchmarks")
+
+
+@app.command()
+def solve(
+    benchmark: Annotated[
+        str,
+        typer.Argument(help="Benchmark name (e.g., 'dafnybench')"),
+    ],
+    framework: Annotated[
+        str,
+        typer.Option(
+            "--framework",
+            "-f",
+            help="Framework to use: 'inspect' or 'rawdog'",
+        ),
+    ] = "inspect",
+    max_attempts: Annotated[
+        int | None,
+        typer.Option(
+            "--max-attempts",
+            "-n",
+            help="Maximum verification attempts with error feedback (default: let Inspect AI decide)",
+        ),
+    ] = None,
+    model: Annotated[
+        str | None,
+        typer.Option(
+            "--model",
+            "-m",
+            help="Model to evaluate (e.g., 'anthropic/claude-3-5-sonnet-20241022')",
+        ),
+    ] = None,
+    limit: Annotated[
+        int,
+        typer.Option(
+            "--limit",
+            "-l",
+            help="Limit number of samples to evaluate (default: 10, use -1 for all 782 samples)",
+        ),
+    ] = 10,
+) -> None:
+    """
+    Run evaluation on a formal verification benchmark.
+
+    Examples:
+
+        # Run DafnyBench with Inspect AI (default: 10 samples, natural iteration)
+        uv run solve dafnybench --framework inspect
+
+        # Run with all 782 samples
+        uv run solve dafnybench --framework inspect --limit -1
+
+        # Run with explicit max attempts
+        uv run solve dafnybench --framework inspect --max-attempts 3
+
+        # Run with specific model
+        uv run solve dafnybench -f inspect -m anthropic/claude-3-5-sonnet-20241022
+    """
+    if benchmark.lower() == "dafnybench":
+        if framework.lower() == "inspect":
+            from evals.dafnybench.inspect_ai import run_dafnybench_eval
+
+            # Convert limit=-1 to None (all samples)
+            eval_limit = None if limit == -1 else limit
+
+            if max_attempts is not None:
+                typer.echo(f"Running DafnyBench with Inspect AI (max_attempts={max_attempts}, limit={limit if limit != -1 else 'all'})...")
+            else:
+                typer.echo(f"Running DafnyBench with Inspect AI (natural iteration, limit={limit if limit != -1 else 'all'})...")
+            run_dafnybench_eval(
+                max_attempts=max_attempts,
+                model=model,
+                limit=eval_limit,
+            )
+        elif framework.lower() == "rawdog":
+            typer.echo("rawdog framework not yet implemented", err=True)
+            raise typer.Exit(code=1)
+        else:
+            typer.echo(f"Unknown framework: {framework}", err=True)
+            typer.echo("Available frameworks: inspect, rawdog", err=True)
+            raise typer.Exit(code=1)
+    else:
+        typer.echo(f"Unknown benchmark: {benchmark}", err=True)
+        typer.echo("Available benchmarks: dafnybench", err=True)
+        raise typer.Exit(code=1)
+
+
 def main() -> None:
-    print("Hello from evals!")
+    """Entry point for the CLI."""
+    app()
Original file line number	Diff line number	Diff line change
`@@ -11,6 +11,9 @@ dependencies = [`
`11`	`11`	`"pydantic>=2.12.4",`
`12`	`12`	`"pydantic-ai>=1.19.0",`
`13`	`13`	`"datasets>=4.4.1",`
	`14`	`+ "inspect-ai>=0.3.146",`
	`15`	`+ "typer>=0.20.0",`
	`16`	`+ "anthropic>=0.73.0",`
`14`	`17`	`]`
`15`	`18`
`16`	`19`	`[project.scripts]`