DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, and Kaidi Xu
- [2025-06-29]: All code and datasets have been officially released!
- [2025-05-16]: Our paper has been accepted to Findings of ACL 2025.
- [2025-03-13]: Our paper has been released. Read it here
This repo contains code for our paper:DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation. DynaCode is a dynamic, complexity-aware code generation benchmark that evaluates LLMs across 189 million diverse problems by incorporating code complexity and call-graph structures, revealing performance degradation as complexity increases and offering insights into LLMs’ reasoning over nested code.
Dependencies can be installed by running
conda create -n DynaCode python=3.9
conda activate DynaCode
pip install -r requirements.txt
DynaCode supports API-based LLM inference, including:
You can also easily modify the code to integrate your own LLM inference backend.
To run evaluation on the default benchmark:
cd Evaluate
Generate code solutions using:
python generate_code.py \
--model "LLM_Model_Name" \
--api_key "API" \
--base_url "URL" \
--benchmark_root ./Benchmark \
--output_root ./LLM_code \
--max_workers 50 \
--units 1 2 3 4\
--graphs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
This will create an ./LLM_code
directory with generated code solutions.
To evaluate these solutions:
./eval/eval.sh
The script will run for approximately 5–30 minutes.
Upon completion, the evaluation results (Pass@1) will be saved to pass1_summary.csv
.
To create a new benchmark, run the benchmark generation script:
cd Generate_benchmark
Then execute the script below (approximate time: 10–60 minutes, depending on size):
./run_all_units.sh
To generate a different benchmark set, modify the random seed in generate_original_benchmark.py
. This allows you to produce diverse sets of tasks with the same configuration.
#!/bin/bash
# Define the list of unit complexities to process
units=(1 2 3 4)
# Step 1: Generate the original benchmark data with specified size (default: 250)
echo "📌 Step 1: Generating original benchmark data"
python generate_orignal_benchmark.py --size 250
# Step 2: Filter each unit in parallel
echo "📌 Step 2: Filtering benchmark data for each unit..."
for unit in "${units[@]}"; do
(
cd "./filtered_benchmark/filter_unit${unit}" || { echo "❌ Failed to cd into filter_unit${unit}"; exit 1; }
echo "🚀 Running filter_unit${unit}.py ..."
python -u "filter_unit${unit}.py" > "../log_filter_unit${unit}.out" 2>&1
) &
done
# Wait for all filtering jobs to finish
wait
# Step 3: Save the final benchmark set after filtering (default: 100 tasks per graph)
echo "📌 Step 3: Saving the final filtered benchmark..."
python ./filtered_benchmark/save_benchmark.py --size 100
echo "✅ All scripts completed successfully."
Once complete, replace the benchmark in the Evaluate
folder with your new version:
mv ./final_benchmark ./Evaluate/Benchmark
Now you're ready to run inference and evaluation on your custom benchmark 🎯
If this repo is helpful, please cite our work as:
@article{hu2025dynacode,
title={Dynacode: A dynamic complexity-aware code benchmark for evaluating large language models in code generation},
author={Hu, Wenhao and Duan, Jinhao and Wei, Chunchen and Zhang, Li and Zhang, Yue and Xu, Kaidi},
journal={Findings of the 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025}
}