DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Authors: Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, and Kaidi Xu

📚 Table of Contents

📢 News
📙 Overview
🔥 Quick Start
📜 Citation
🙏 Acknowledgement

📢 News

[2025-06-29]: All code and datasets have been officially released!
[2025-05-16]: Our paper has been accepted to Findings of ACL 2025.
[2025-03-13]: Our paper has been released. Read it here

📙 Overview

This repo contains code for our paper:DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation. DynaCode is a dynamic, complexity-aware code generation benchmark that evaluates LLMs across 189 million diverse problems by incorporating code complexity and call-graph structures, revealing performance degradation as complexity increases and offering insights into LLMs’ reasoning over nested code.

🔥 Quick Start

🛠️ Environment Setup

Dependencies can be installed by running

conda create -n DynaCode python=3.9
conda activate DynaCode
pip install -r requirements.txt

🚀LLM Inference

DynaCode supports API-based LLM inference, including:

Remote APIs such as OpenAI and DeepInfra

You can also easily modify the code to integrate your own LLM inference backend.

🧪 Evaluate LLM Performance

To run evaluation on the default benchmark:

cd Evaluate

Generate code solutions using:

 python generate_code.py \
  --model "LLM_Model_Name" \
  --api_key "API" \
  --base_url "URL" \
  --benchmark_root ./Benchmark \
  --output_root ./LLM_code \
  --max_workers 50 \
  --units 1 2 3 4\
  --graphs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

This will create an ./LLM_code directory with generated code solutions.

To evaluate these solutions:

./eval/eval.sh

The script will run for approximately 5–30 minutes. Upon completion, the evaluation results (Pass@1) will be saved to pass1_summary.csv.

🧬Generate new benchmark

To create a new benchmark, run the benchmark generation script:

cd Generate_benchmark

Then execute the script below (approximate time: 10–60 minutes, depending on size):

./run_all_units.sh

To generate a different benchmark set, modify the random seed in generate_original_benchmark.py . This allows you to produce diverse sets of tasks with the same configuration.

#!/bin/bash

# Define the list of unit complexities to process
units=(1 2 3 4)

# Step 1: Generate the original benchmark data with specified size (default: 250)
echo "📌 Step 1: Generating original benchmark data"
python generate_orignal_benchmark.py --size 250

# Step 2: Filter each unit in parallel
echo "📌 Step 2: Filtering benchmark data for each unit..."
for unit in "${units[@]}"; do
  (
    cd "./filtered_benchmark/filter_unit${unit}" || { echo "❌ Failed to cd into filter_unit${unit}"; exit 1; }
    echo "🚀 Running filter_unit${unit}.py ..."
    python -u "filter_unit${unit}.py" > "../log_filter_unit${unit}.out" 2>&1
  ) &
done

# Wait for all filtering jobs to finish
wait

# Step 3: Save the final benchmark set after filtering (default: 100 tasks per graph)
echo "📌 Step 3: Saving the final filtered benchmark..."
python ./filtered_benchmark/save_benchmark.py --size 100

echo "✅ All scripts completed successfully."

Once complete, replace the benchmark in the Evaluate folder with your new version:

mv ./final_benchmark ./Evaluate/Benchmark

Now you're ready to run inference and evaluation on your custom benchmark 🎯

📜 Citation

If this repo is helpful, please cite our work as:

@article{hu2025dynacode,
  title={Dynacode: A dynamic complexity-aware code benchmark for evaluating large language models in code generation},
  author={Hu, Wenhao and Duan, Jinhao and Wei, Chunchen and Zhang, Li and Zhang, Yue and Xu, Kaidi},
  journal={Findings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

📚 Table of Contents

📢 News

📙 Overview

🔥 Quick Start

🛠️ Environment Setup

🚀LLM Inference

🧪 Evaluate LLM Performance

🧬Generate new benchmark

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Evaluate		Evaluate
Generate_benchmark		Generate_benchmark
pictures		pictures
README.md		README.md
requirements.txt		requirements.txt

HWH-2000/DynaCode

Folders and files

Latest commit

History

Repository files navigation

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

📚 Table of Contents

📢 News

📙 Overview

🔥 Quick Start

🛠️ Environment Setup

🚀LLM Inference

🧪 Evaluate LLM Performance

🧬Generate new benchmark

📜 Citation

🙏 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages