SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Setting Up the Environment

1. Clone the repository and enter the directory:

git clone https://github.com/Anjiang-Wei/SATBench.git
cd SATBench

2. Create and activate a conda environment:

conda create -y -n SATBench python=3.12
conda activate SATBench

3. Install dependencies:

pip install -r requirements.txt

4. Set API keys:

export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export TOGETHER_API_KEY=<your_together_api_key>

Dataset Generation Pipeline

1. Generate SAT/UNSAT CNF Problems

python step1.1_sat_problem_generation.py \
    --output sat_problems.json \
    --num_per_config 10 \
    --max_clause_len 3 \
    --sat_ratio 0.5 \
    --dimensions "[[3,2],[3,3,2]]" \
    --clauses "[4,5,6,7,8,9,10]"

Key Arguments:

--output: Output file for SAT problems
--num_per_config: Number of examples per config
--dimensions: Variable dimensionality
--clauses: Clause counts to vary difficulty

2. Generate Scenario + Variable Mapping

python step1.2_scenario_mapping_generation.py \
    --input sat_problems.json \
    --output scenario_and_mapping.jsonl

Key Arguments:

--input: Input CNF problems
--output: Output JSONL file with generated scenarios and variable meanings

3. Generate Final Puzzle Conditions and Questions

python step1.3_puzzle_generation.py \
    --input scenario_and_mapping.jsonl \
    --output puzzle_problems.jsonl \
    --target_per_clause 5 \
    --clause_values "[4,5,6]"

Key Arguments:

--input: Scenario and mapping file
--output: Final puzzle output
--target_per_clause: Number of examples to generate per clause count
--clause_values: Which clause numbers to include in the final benchmark

You can also access the released version on Hugging Face:

👉 https://huggingface.co/datasets/LLM4Code/SATBench

Each puzzle includes:

A natural language scenario + variable mapping
A list of logical conditions
A final question
Ground-truth SAT/UNSAT label

Evaluation Pipeline

We provide scripts to evaluate LLMs on SATBench directly from the Hugging Face dataset.

Running Evaluations

python step2_evaluation.py \
    --mode sat \
    --eval_model openai_gpt-4o \
    --limit 10

python step2_evaluation.py \
    --mode trace \
    --eval_model openai_gpt-4o

Modes:

--mode sat: Run SAT/UNSAT prediction for each puzzle.
--mode trace: Judge reasoning trace quality (after SAT/UNSAT evaluation).

Arguments:

--limit: Max number of examples to evaluate (optional).
--eval_model: Model used to answer puzzles (e.g., openai_gpt-4o, together_Qwen3-14B).

Statistics Summary

After evaluation, generate per-bucket statistics:

python step3_stats.py

Or for specific models:

python step3_stats.py --models openai_gpt-4o

This prints accuracy by SAT/UNSAT and difficulty (easy/medium/hard)

Supported Models

SATBench supports evaluation for a wide range of models via the OpenAI, Anthropic, and Together APIs:

openai_gpt-4o
openai_gpt-4o-mini
openai_o4-mini
together_deepseek-ai_DeepSeek-R1
together_deepseek-ai_DeepSeek-V3
anthropic_claude-3-7-sonnet-20250219
together_Qwen_Qwen3-235B-A22B-fp8-tput
together_Qwen_QwQ-32B
together_deepseek-ai_DeepSeek-R1-Distill-Qwen-14B
together_meta-llama_Llama-4-Scout-17B-16E-Instruct
together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8
together_meta-llama_Llama-3.3-70B-Instruct-Turbo
together_meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo
together_meta-llama_Meta-Llama-3.1-8B-Instruct-Turbo

Citation

If you use this benchmark in your research, please cite:

@article{wei2025satbench,
  title={SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas},
  author={Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex},
  journal={arXiv preprint arXiv:2505.14615},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval_utils		eval_utils
generators		generators
prompts		prompts
README.md		README.md
requirements.txt		requirements.txt
step1.1_sat_problem_generation.py		step1.1_sat_problem_generation.py
step1.2_scenario_mapping_generation.py		step1.2_scenario_mapping_generation.py
step1.3_puzzle_generation.py		step1.3_puzzle_generation.py
step2_evaluation.py		step2_evaluation.py
step3_stats.py		step3_stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Setting Up the Environment

1. Clone the repository and enter the directory:

2. Create and activate a conda environment:

3. Install dependencies:

4. Set API keys:

Dataset Generation Pipeline

1. Generate SAT/UNSAT CNF Problems

2. Generate Scenario + Variable Mapping

3. Generate Final Puzzle Conditions and Questions

Evaluation Pipeline

Running Evaluations

Statistics Summary

Supported Models

Citation

About

Uh oh!

Releases

Packages

Languages

Anjiang-Wei/SATBench

Folders and files

Latest commit

History

Repository files navigation

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Setting Up the Environment

1. Clone the repository and enter the directory:

2. Create and activate a conda environment:

3. Install dependencies:

4. Set API keys:

Dataset Generation Pipeline

1. Generate SAT/UNSAT CNF Problems

2. Generate Scenario + Variable Mapping

3. Generate Final Puzzle Conditions and Questions

Evaluation Pipeline

Running Evaluations

Statistics Summary

Supported Models

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages