Code and data for the following works:
- 
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
 - 
HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro
 - 
Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
 - 
Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial
 
(10/28) We have the SWE-Agent scaffold to reproduce results and a step-by-step guide below. We have confirmed that this reproduces the Sonnet 4.5 results.
(10/3) We have updated results without cap limit here: https://scaleapi.github.io/SWE-bench_Pro-os/
SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench
To access SWE-bench Pro, copy and run the following code:
from datasets import load_dataset
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')SWE-bench Pro uses Docker for reproducible evaluations. In addition, the evaluation script requires Modal to scale the evaluation set.
Follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.
Run the following commands to store modal credentials:
pip install modal
modal setup # and follow the prompts to generate your token and secret
After running these steps, you should be able to see a token ID and secret in  ~/.modal.toml:
EG:
token_id = <token id>
token_secret = <token secret>
active = true
We store prebuilt Docker images for each instance. They can be found in this directory:
https://hub.docker.com/r/jefzda/sweap-images
The format of the images is as follows.
jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}
For example:
jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03
Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See #6
Generate patch predictions using your harness of choice.
For generating patches using SWE-agent, see the SWE-agent directory which contains detailed instructions on how to:
- Set up SWE-agent for patch generation
 - Run SWE-agent on SWE-Bench Pro instances
 - Configure model parameters and turn limits
 
The output will be .pred files containing model-generated patches for each instance.
After generating patches, use the gather_patches.py helper script to collect all patches into a single JSON file for evaluation:
python helper_code/gather_patches.py \
    --directory <path_to_pred_files> \
    --prefix <model_name> \
    --output <output_file>.jsonParameters:
--directory: Directory containing instance folders with.predfiles (e.g., from SWE-agent output or downloaded trajectories)--prefix: Prefix identifier for your model/run (e.g., "gpt4", "claude-sonnet", "sample1")--output: Output JSON file path
Example:
python helper_code/gather_patches.py \
    --directory swe_bench_pro_results/sample1 \
    --prefix sample1 \
    --output sample1_patches.jsonThis will create a JSON file in the format expected by the evaluation script:
[
  {
    "instance_id": "instance_...",
    "patch": "diff --git ...",
    "prefix": "sample1"
  }
]Evaluate patch predictions on SWE-Bench Pro with the following command:
python swe_bench_pro_eval.py \
    --raw_sample_path=swe_bench_pro_full.csv \
    --patch_path=<your_patches>.json \
    --output_dir=<output_directory> \
    --scripts_dir=run_scripts \
    --num_workers=100 \
    --dockerhub_username=jefzdaReplace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV. Gold Patches can be compiled from the HuggingFace dataset.
To reproduce leaderboard results end-to-end, follow the following steps:
- Complete setup in the 
SWE-agentfolder. We recommend to use the Docker image to run the scaffold, viajust. - Run the scaffold. We have included an example for Claude Sonnet 4.5 (claude.yaml) but feel free to use any model. It also supports 
vllmfor local models. Note that we recommend using the DockerHub images rather than building the Docker images from scratch. You can also execute it locally without Modal. - Compile predictions with compile_predictions.py.
 - Run the evaluation script 
swe_bench_pro_eval.pyto run the evaluation script.