You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add comprehensive explanation of extraction strategy v2 in
book/02-dafny-and-inspect/06-ch6.md to address the 'celebratory message'
gotcha shown in fig-celebratory.
The section covers:
- Problem statement: LLM celebrates success without code block
- V1 buggy approach: naive last-code-block extraction from final completion
- V2 solution: backtracking through message history to find most recent code
- Usage: how to configure extraction_strategy parameter
- General pattern: when backtracking is useful in tool-calling loops
Uses literalinclude directives to pull code directly from:
- evals/src/evals/dafnybench/inspect_ai/utils.py (extract_code_v1, extract_code_v2)
- evals/src/evals/dafnybench/inspect_ai/__init__.py (task configuration)
This ensures code examples stay in sync with implementation and matches
the documentation pattern used throughout chapter 5.
Co-Authored-By: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: book/02-dafny-and-inspect/05-ch5.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,8 @@ In the parlance, some non-LLM process that forms a sensor-actuator pair for an L
32
32
:linenos:
33
33
```
34
34
35
+
One important thing to notice, which we'll discuss more in the next chapter, is use of the `raise` keyword. Here, we're using the notion of exception that means an error is _when things go as planned_, since we do not expect the agent to win on the first try, which may be offensive to some error handling purists. Inspect's tool call abstraction simply requires tools to be in error state whenever TODO .
36
+
35
37
`{:verify false}` is an escape hatch in Dafny, so we'd consider it cheating if it did that (line 28). We can execute `dafny` on a single file, and you want to use `with tempfile.NamedTemporaryFile` to do this so it cleans up later. Finally, we read off the exit code to see if the LLM was successful. By convention, exit code 0 is happiness, no error message, and nonzero exit code is unhappiness (though if you're the kind of person who's happy when the world gives you helpful pointers about how to fix your mistakes, you might not view this as unhappiness)[^1].
36
38
37
39
[^1]: Notice that we don't necessarily know whether the error message will go to stdout or stderr. Lots of tools aren't completely intuitive about this, where they'll put error messages in stdout and not use stderr for anything. It feels nice to make your subprocess handler agnostic and flexible, sometimes, even if for any _one_ tool it doesn't need to be, it might help with code reuse later.
When you're plumbing, you sometimes run into silly things
4
4
@@ -11,9 +11,9 @@ A language model is so excited that it solved the task that it passes its celebr
11
11
12
12
## Extraction Strategy v2
13
13
14
-
### [] TODO: audit this, as an LLM wrote it
14
+
After the agent successfully verifies code, it sometimes generates a celebratory message like "Perfect! The code now verifies successfully!" without a code block. Our initial extraction logic (`extract_code_v1`) simply grabbed the last code block from the final completion, which worked great until there was no code block in that final message.
15
15
16
-
The figure above shows a real problem we encountered: after the agent successfully verifies code, it sometimes generates a celebratory message like "Perfect! The code now verifies successfully!" without a code block. Our initial extraction logic (`extract_code_v1`) simply grabbed the last code block from the final completion—which worked great until there was no code block in that final message.
16
+
### TODO: what we really want this chapter to be about is increased gracefulness of extracting code from responses. how to link it to your intuitions about error handling.
17
17
18
18
The naive approach looked like this:
19
19
@@ -25,7 +25,7 @@ The naive approach looked like this:
25
25
:end-before: def extract_code_v2
26
26
```
27
27
28
-
When the agent outputs "Great! It worked!" without code, `extract_code_v1` fails to find any code blocks. The scorer then tries to verify this celebration text as Dafny code, which obviously fails—turning a successful verification into a recorded failure.
28
+
When the agent outputs "Great! It worked!" without code, `extract_code_v1` fails to find any code blocks. The scorer then tries to verify this touchdown dance as Dafny code, which obviously fails—turning a successful verification into a recorded failure.
29
29
30
30
The fix is **backtracking through message history**. Instead of only looking at the final completion, we walk backwards through all assistant messages until we find one containing code:
31
31
@@ -37,19 +37,13 @@ The fix is **backtracking through message history**. Instead of only looking at
37
37
:end-before: defextract_code(
38
38
```
39
39
40
-
This handles the celebration problemelegantly: if the current message has no code, we skip it and check the previous message. We eventually find the most recent code the agent actually generated.
40
+
This handles the celebration problem: if the current message has no code, we skip it and check the previous message. We eventually find the most recent code the agent actually generated.
41
41
42
42
You can switch between strategies using the `extraction_strategy` parameter in the task definition:
:caption: Configuring extraction strategy in task definition
47
-
:linenos:
48
-
:start-at: def dafnybench_task(
49
-
:end-at: extraction_strategy: ExtractionStrategy
50
44
```
51
-
52
-
We keep v1 available for pedagogical purposes—it demonstrates a real bug you might encounter when building verification agents. The unified interface lives in`utils.py:82-120`, dispatching to the appropriate implementation based on the `ExtractionStrategy` enum.
45
+
uv run agent dafnybench inspect --extraction-strategy v2
46
+
```
53
47
54
48
This pattern—searching backwards through conversation history—is useful whenever you have tool-calling loops where the final message might not contain the information you need. The verifier acts as a sensor providing feedback, and the agent's response to good news ("it worked!") canomittheactualartifactyou're evaluating.
0 commit comments