feat: `#evaluation in ...` command for easier evaluation #1706

alexkeizer · 2025-09-25T09:25:28Z

This PR adds an #evaluation in $cmd command similar to #guard_msgs in $cmd, which collects all messages generated by an arbitrary command, and serializes them into a JSON object. The idea is that this will be a building block for any kind of evaluation we'd like to do, to easily have existing tactics output JSONL without having to modify those tactics.

I was originally planning to use this to convert the existing LLVM evaluation to JSONL, but it seems that evaluation has fundamentally changed. @bollu is the LLVM evaluation superseded by your evaluation?

TODO:

add a CI job which builds & tests EvaluationHarness

…single JSON object

@bollu

some options suggested by @bollu, such as omitting the details of messages and including walltime, and some other options to make it easier to run tests

alexkeizer · 2025-09-25T09:27:05Z

@bollu I added the options you mentioned, could you have a look to see if this is usable in its current form, or whether we need more?

github-actions · 2025-09-25T09:48:19Z

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.
both bitwuzla and bv_decide failed on 0 problems.
In total, bitwuzla saw 0 problems.
In total, bv_decide saw 0 problems.
ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS
The InstCombine benchmark contains 4525 theorems in total.
Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv
all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan
mean of percentage stddev/av: nan%

github-actions · 2025-09-25T13:11:48Z

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.
both bitwuzla and bv_decide failed on 0 problems.
In total, bitwuzla saw 0 problems.
In total, bv_decide saw 0 problems.
ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS
The InstCombine benchmark contains 4525 theorems in total.
Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv
all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan
mean of percentage stddev/av: nan%

bollu · 2025-09-25T16:44:51Z

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

alexkeizer · 2025-09-26T08:35:42Z

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

So when I open, say, gzext_proof.lean, I see the following:

theorem test_sext_zext_thm (e : IntW 16) : sext 64 (zext 32 e) ⊑ zext 64 e := by
    simp_alive_undef
    simp_alive_ops
    simp_alive_case_bash
    simp_alive_split
    extract_goals
    all_goals sorry

I assumed the extract_goals there was your doing, and I wonder if that is what's broken the evaluation CI; notice how it now just reports 0 everywhere:

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.

alexkeizer added 7 commits September 24, 2025 16:53

feat: evaluation harness to capture messages and serialize them as a …

1b42ee8

…single JSON object

move and fix tests

8668292

feat: add strategy field

4d59094

WIP

3b40bbf

WIP

805c3c4

add private/protected def test cases for getDefLikeName

bff1a74

add & implement various options

db4f14c

some options suggested by @bollu, such as omitting the details of messages and including walltime, and some other options to make it easier to run tests

alexkeizer requested a review from bollu September 25, 2025 09:25

alexkeizer marked this pull request as draft September 25, 2025 09:25

ensure lean-toolchain is a symlink

75dbcf1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: `#evaluation in ...` command for easier evaluation #1706

feat: `#evaluation in ...` command for easier evaluation #1706

Uh oh!

alexkeizer commented Sep 25, 2025 •

edited

Loading

Uh oh!

alexkeizer commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

bollu commented Sep 25, 2025

Uh oh!

alexkeizer commented Sep 26, 2025

Uh oh!

Uh oh!

feat: #evaluation in ... command for easier evaluation #1706

Are you sure you want to change the base?

feat: #evaluation in ... command for easier evaluation #1706

Uh oh!

Conversation

alexkeizer commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexkeizer commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

bollu commented Sep 25, 2025

Uh oh!

alexkeizer commented Sep 26, 2025

Uh oh!

Uh oh!

feat: `#evaluation in ...` command for easier evaluation #1706

feat: `#evaluation in ...` command for easier evaluation #1706

alexkeizer commented Sep 25, 2025 •

edited

Loading