Skip to content

Conversation

alexkeizer
Copy link
Collaborator

@alexkeizer alexkeizer commented Sep 25, 2025

This PR adds an #evaluation in $cmd command similar to #guard_msgs in $cmd, which collects all messages generated by an arbitrary command, and serializes them into a JSON object. The idea is that this will be a building block for any kind of evaluation we'd like to do, to easily have existing tactics output JSONL without having to modify those tactics.

I was originally planning to use this to convert the existing LLVM evaluation to JSONL, but it seems that evaluation has fundamentally changed. @bollu is the LLVM evaluation superseded by your evaluation?

TODO:

  • add a CI job which builds & tests EvaluationHarness

some options suggested by @bollu, such as omitting the details of messages and including walltime, and some other options to make it easier to run tests
@alexkeizer alexkeizer requested a review from bollu September 25, 2025 09:25
@alexkeizer alexkeizer marked this pull request as draft September 25, 2025 09:25
@alexkeizer
Copy link
Collaborator Author

@bollu I added the options you mentioned, could you have a look to see if this is usable in its current form, or whether we need more?

Copy link

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.
both bitwuzla and bv_decide failed on 0 problems.
In total, bitwuzla saw 0 problems.
In total, bv_decide saw 0 problems.
ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS
The InstCombine benchmark contains 4525 theorems in total.
Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv
all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan
mean of percentage stddev/av: nan%

Copy link

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.
both bitwuzla and bv_decide failed on 0 problems.
In total, bitwuzla saw 0 problems.
In total, bv_decide saw 0 problems.
ran rg 'LeanSAT provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla provided a counter' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'LeanSAT proved' | wc -l, this file found 0, rg found 0, SUCCESS
ran rg 'Bitwuzla proved' | wc -l, this file found 0, rg found 0, SUCCESS
The InstCombine benchmark contains 4525 theorems in total.
Saved dataframe at: /home/runner/work/lean-mlir/lean-mlir/bv-evaluation/raw-data/InstCombine/instcombine_ceg_data.csv
all_files_solved_bitwuzla_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_rw_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_bb_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_sat_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratt_times_stddev avg: nan | stddev: nan
all_files_solved_bv_decide_lratc_times_stddev avg: nan | stddev: nan
mean of percentage stddev/av: nan%

@bollu
Copy link
Collaborator

bollu commented Sep 25, 2025

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

@alexkeizer
Copy link
Collaborator Author

@alexkeizer No, the LLVM evaluation has not been changed by me. What's in the repo is the "latest". I attempted to improve it using snakemake here #1549 , but that's only the hacker's delight portion, and not the InstCombine portion.

So when I open, say, gzext_proof.lean, I see the following:

theorem test_sext_zext_thm (e : IntW 16) : sext 64 (zext 32 e) ⊑ zext 64 e := by
    simp_alive_undef
    simp_alive_ops
    simp_alive_case_bash
    simp_alive_split
    extract_goals
    all_goals sorry

I assumed the extract_goals there was your doing, and I wonder if that is what's broken the evaluation CI; notice how it now just reports 0 everywhere:

bv_decide solved 0 theorems.
bitwuzla solved 0 theorems.
bv_decide found 0 counterexamples.
bitwuzla found 0 counterexamples.
bv_decide only failed on 0 problems.
bitwuzla only failed on 0 problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants