Your automated factory for GitHub Issue Resolution Training Data and Evaluation Benchmarks.
- An automated pipeline for GitHub issue resolution data collection, reducing your manual effort!
- Produce reliable and reproducible Docker-based evaluation environments
- Automatic environment construction using the LLM-powered multi-agent system (SWE-Builder)
- Support for multiple programming languages (we have evaluated Python, Java, JS, and TS extensively.)
Our experiments are conducted using Docker version 27.0.3-1 and Ubuntu 22.04.4 LTS.
To get started, run the following commands to set up the environment:
conda create --name swe-factory python=3.12.5 -y
conda activate swe-factory
pip install -r requirements.txt
We use GitHub APIs and predefined patterns to collect raw issue data (e.g., python-mypy-instances.jsonl
). Check the detailed tutorial in the data_collection/collect directory.
After collecting raw issue data, set up the evaluation environment by running:
export OPENAI_API_BASE_URL=<your_base_url>
export OPENAI_KEY=<your_key>
python app/main.py swe-bench \
--model gpt-4.1-mini \
--tasks-map "python-mypy-instances.jsonl" \
--num-processes 10 \
--model-temperature 0.2 \
--conv-round-limit 10 \
--output-dir "output/git-4.1-mini/mypy" \
--setup-dir "testbed" \
--results-path "output/git-4.1-mini/mypy/results"
We employ SWE-Builder, an LLM-based multi-agent system consisting of:
-
🔍 Repository Explorer
- Gathers environment setup and test commands automatically.
-
🐳 Environment Manager
- Generates Dockerfiles for reproducible test environments.
-
📝 Test Manager
- Writes evaluation scripts to run tests inside containers.
-
🔬 Test Analyst
- Validates generated environments and orchestrates iterative refinement.
-
💾 Evaluation Environment Memory Pool
- Reuses previously successful setups for efficiency and consistency.
We evaluated SWE-Builder using three base models:
Base Model | Valid Rate (%) | Success Rate (%) | Cost (USD) | Time (min) |
---|---|---|---|---|
GPT-4.1-mini | 40.1 (269/671) | 57.2 (384/671) | 0.045 | 22.4 |
DeepSeek-v3-0324 | 34.6 (232/671) | 50.8 (341/671) | 0.043 | 22.5 |
Gemini-2.5-flash-preview | 33.5 (225/671) | 49.8 (334/671) | 0.024 | 27.0 |
To reproduce these experiments:
export OPENAI_API_BASE_URL=<your_base_url>
export OPENAI_KEY=<your_key>
bash run/run.sh
After generating evaluation environments, perform Fail2Pass validation:
-
Obtain test logs before and after applying the ground-truth patch. Check evaluation for detailed instructions.
-
Run automated Fail2Pass validation:
python scripts/judge_fail2pass.py evaluation/run_instance/mypy_gpt-4.1-mini/gold fail2pass_status.json
The validated instances can be filtered using the generated fail2pass_status.json
.
Note: Although our automated validation demonstrates high precision, manual checks are recommended to ensure dataset quality, particularly to identify and filter out error-to-pass cases.
After building your dataset for evaluation and training, check the evaluation directory for detailed instructions on how to run tests and obtain test exection feedback.
If SWE-Factory helps your research or projects, star ⭐ our repo or cite us:
@article{guo2025swefactory,
title={SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks},
author={Lianghong Guo and Yanlin Wang and Caihua Li and Pengyu Yang and Jiachi Chen and Wei Tao and Yingtian Zou and Duyu Tang and Zibin Zheng},
journal={arXiv preprint arXiv:2506.10954},
year={2025},
url={https://arxiv.org/abs/2506.10954},
}
- We build upon prior research — SWE-bench, AutoCodeRover, Magis, and OmniGIRL — foundational to our work.
- Huge thanks to the open-source developer community; your invaluable contributions underpin software engineering research! ❤️