Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing

This repository contains the code for the paper
Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing
accepted to ASE '25.

Setup

This repository is tested and recommended on Ubuntu 22.04 and macOS (15.6 or newer) with Python 3.12. Docker is required because the tests are ran in an isolated Docker environment, so ensure the daemon is running before proceeding. Detailed instructions on how to install Docker can be found here.

Start by installing the following:

sudo apt update
sudo apt install -y python3.12-dev build-essential git pkg-config

Create a Python 3.12 virtual environment and install the required packages:

python3.12 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

This repo also uses LLMs, specifically gpt-4o that requires an OpenAI key (default) and llama/deepseek that require a Groq key. You can set your keys like this:

export OPENAI_API_KEY="sk-XXXX"   # required
export GROQ_API_KEY="gsk-YYYY"    # optional (for ablation studies with LLaMA/DeepSeek)

A less secure but more convenient alternative is to save the keys in keys.json. Currently, the code will first search in the environment variables and if the key is not found there, it will search in keys.json.

Running BLAST

BLAST is implemented end-to-end in the script ./run_blast.sh, which also contains information about the available CLI options. Make the script executable by running:

chmod +x run_blast.sh

Then run BLAST:

./run_blast.sh <mylabel>

For debug mode (e.g., quick start, local development), use: sh ./run_blast.sh <mylabel> 1. To run the LLM component or the SBST component separately, see Ablations.

Running the Baselines

To run the two baselines of Table 1, i.e., ZeroShot and AutoTDD, follow the instructions of baselines/README.md.

Ablations

The script run_blast.sh is customizable to allow for easy ablation studies, like the ones reported in the paper and more.

Table I

To run BLAST with models other than the default gpt-4o, simply add the flag --model llama-3.3-70b-versatile or --model deepseek-r1-distill-llama-70b to the first two scripts of run_blast.sh.

Table II

To run only the LLM component of BLAST, simply comment out the first step of run_blast.sh.

The input combinations C1-C7 of Table II can be reproduced by switching the boolean inputs of the build_prompt() function in run_llm_component.py. For example, to reproduce the combination C1, we set include_issue_description=True and include_predicted_test_file=True with the rest of the parameters being set to False.

Table III

To run only the SBST component of BLAST, simply comment out steps 2 and 3 of run_blast.sh. You can vary the time budget using the cli argument --budget_seconds 6 of run_sbst_component.py.

This will run Pynguin on the PyngBench dataset that we introduced in Section V.B (Pynguin-Compatible Dataset), and which contains the 113 instances defined in pyngbench_ids.txt. These instances were obtained by running the run_sbst_component.py on all instances (e.g., by removing the filtering in line 387) and keeping only the instances were Pynguin ran successfully. The run_sbst_component.py can be used for further experimentation towards applying Pynguin to issue-reproducing test generation.

Trivial Instances of SWTBench/TDDBench

In Section V.B (Data Cleaning) of our paper, we manually analyzed instances of TDDBench/SWTBench to check if they are trivial for the test generation task, like in Fig. 5. We performed this manual step since TDDBench and SWTBench are adaptations of SWEBench, which is a benchmark designed for patch generation and not test generation.

Our analysis revealed 23 instances where the test is included in the issue description, which we marked as trivial and did not use them in our analysis. The instance IDs can be found in trivial_ids.txt and the corresponding PR urls are available by expanding the list below. Each PR is linked to an issue, which contains the code to reproduce the issue.

Click to expand PR URLs

Manual Analysis of Failures

In Section VI.C (Analysis of Failure Cases), we perform a card sorting of cases were BLAST failed to generate a fail-to-pass test, on an attempt to better understand BLAST's limitations. The rationale behind the card sorting can be found in manual_sorting_of_failures.xlsx.

In-Vivo Evaluation with Mozilla

In Section VII, we evaluated BLAST by deploying it in 3 open-source repositories of Mozilla. The source code of the bot that runs BLAST whenever a new PR is opened can be found under github_bot/, along with instructions to install the bot to listen to your own repository.

The feedback we received from the developers for the 11 fail-to-pass tests generated by BLAST can be found in github_bot/Developer Feedback.xlsx, where the first column also contains the PR link where the test was proposed.

Citation

@inproceedings{blast2025,
  title     = {Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing},
  author    = {Konstantinos Kitsios and Marco Castelluccio and Alberto Bacchelli},
  booktitle = {Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  year      = {2025},
  note      = {to appear}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
baselines		baselines
deserialization		deserialization
git_sandbox		git_sandbox
github_bot		github_bot
repos		repos
tddbench		tddbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TDD_Bench.json		TDD_Bench.json
aggregate_results.py		aggregate_results.py
manual_sorting_of_failures.xlsx		manual_sorting_of_failures.xlsx
pyngbench_ids.txt		pyngbench_ids.txt
pynguin_utils.py		pynguin_utils.py
requirements.txt		requirements.txt
run_blast.sh		run_blast.sh
run_llm_component.py		run_llm_component.py
run_sbst_component.py		run_sbst_component.py
tddbench_verified.pickle		tddbench_verified.pickle
trivial_ids.txt		trivial_ids.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing

Table of Contents

Setup

Running BLAST

Running the Baselines

Ablations

Table I

Table II

Table III

Trivial Instances of SWTBench/TDDBench

Manual Analysis of Failures

In-Vivo Evaluation with Mozilla

Citation

About

Uh oh!

Releases

Packages

Languages

License

kitsiosk/blast

Folders and files

Latest commit

History

Repository files navigation

Automated Generation of Issue-Reproducing Tests by Combining LLMs and Search-Based Testing

Table of Contents

Setup

Running BLAST

Running the Baselines

Ablations

Table I

Table II

Table III

Trivial Instances of SWTBench/TDDBench

Manual Analysis of Failures

In-Vivo Evaluation with Mozilla

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages