Skip to content

[ACL'2025 Findings] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Notifications You must be signed in to change notification settings

HWH-2000/DynaCode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, and Kaidi Xu

framework

📚 Table of Contents

📢 News

  • [2025-06-29]: All code and datasets have been officially released!
  • [2025-05-16]: Our paper has been accepted to Findings of ACL 2025.
  • [2025-03-13]: Our paper has been released. Read it here

📙 Overview

This repo contains code for our paper:DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation. DynaCode is a dynamic, complexity-aware code generation benchmark that evaluates LLMs across 189 million diverse problems by incorporating code complexity and call-graph structures, revealing performance degradation as complexity increases and offering insights into LLMs’ reasoning over nested code.

🔥 Quick Start

🛠️ Environment Setup

Dependencies can be installed by running

conda create -n DynaCode python=3.9
conda activate DynaCode
pip install -r requirements.txt

🚀LLM Inference

DynaCode supports API-based LLM inference, including:

You can also easily modify the code to integrate your own LLM inference backend.

🧪 Evaluate LLM Performance

To run evaluation on the default benchmark:

cd Evaluate

Generate code solutions using:

 python generate_code.py \
  --model "LLM_Model_Name" \
  --api_key "API" \
  --base_url "URL" \
  --benchmark_root ./Benchmark \
  --output_root ./LLM_code \
  --max_workers 50 \
  --units 1 2 3 4\
  --graphs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

This will create an ./LLM_code directory with generated code solutions.

To evaluate these solutions:

./eval/eval.sh

The script will run for approximately 5–30 minutes. Upon completion, the evaluation results (Pass@1) will be saved to pass1_summary.csv.

🧬Generate new benchmark

To create a new benchmark, run the benchmark generation script:

cd Generate_benchmark

Then execute the script below (approximate time: 10–60 minutes, depending on size):

./run_all_units.sh 

To generate a different benchmark set, modify the random seed in generate_original_benchmark.py . This allows you to produce diverse sets of tasks with the same configuration.

#!/bin/bash

# Define the list of unit complexities to process
units=(1 2 3 4)

# Step 1: Generate the original benchmark data with specified size (default: 250)
echo "📌 Step 1: Generating original benchmark data"
python generate_orignal_benchmark.py --size 250

# Step 2: Filter each unit in parallel
echo "📌 Step 2: Filtering benchmark data for each unit..."
for unit in "${units[@]}"; do
  (
    cd "./filtered_benchmark/filter_unit${unit}" || { echo "❌ Failed to cd into filter_unit${unit}"; exit 1; }
    echo "🚀 Running filter_unit${unit}.py ..."
    python -u "filter_unit${unit}.py" > "../log_filter_unit${unit}.out" 2>&1
  ) &
done

# Wait for all filtering jobs to finish
wait

# Step 3: Save the final benchmark set after filtering (default: 100 tasks per graph)
echo "📌 Step 3: Saving the final filtered benchmark..."
python ./filtered_benchmark/save_benchmark.py --size 100

echo "✅ All scripts completed successfully."

Once complete, replace the benchmark in the Evaluate folder with your new version:

mv ./final_benchmark ./Evaluate/Benchmark

Now you're ready to run inference and evaluation on your custom benchmark 🎯

📜 Citation

If this repo is helpful, please cite our work as:

@article{hu2025dynacode,
  title={Dynacode: A dynamic complexity-aware code benchmark for evaluating large language models in code generation},
  author={Hu, Wenhao and Duan, Jinhao and Wei, Chunchen and Zhang, Li and Zhang, Yue and Xu, Kaidi},
  journal={Findings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

🙏 Acknowledgement

About

[ACL'2025 Findings] DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published