SE-Bench is a diagnostic environment designed to rigorously measure an agent's ability to internalize novel knowledge, which is a foundational capability for true self-evolution.
First, create and activate a dedicated Conda environment, then install the required dependencies for the project.
- Conda (Anaconda/Miniconda) installed
- Docker (required for evaluation sandbox)
# Create a Conda environment named "se-bench" with Python 3.12
conda create -n se-bench python==3.12 -y
# Activate the Conda environment
conda activate se-bench
# Navigate to the SE-Bench project root directory
cd SE-Bench
# Install all required dependencies
pip install -r requirements.txtYou can load the dataset using the Hugging Face datasets library:
from datasets import load_dataset
dataset = load_dataset("jintailin/SE-Bench", "train")
# Data is in dataset['train']
print(dataset)
dataset = load_dataset("jintailin/SE-Bench", "single_test")
# Data is in dataset['train']
print(dataset)
dataset = load_dataset("jintailin/SE-Bench", "multiple_test")
# Data is in dataset['train']
print(dataset)Alternatively, you can run the provided load_datasets.py script to download and save the data to the local directory structure:
python load_datasets.pyThis will generate the following file structure:
| Path | Description | Usage |
|---|---|---|
datasets/train/api_doc.jsonl |
API documentation for the zwc package |
Training material |
datasets/train/train.jsonl |
Training questions | Training material |
datasets/test/single_test.jsonl |
Single-function problems | Evaluation |
datasets/test/multiple_test.jsonl |
Multi-function composition problems | Evaluation |
Protocol: Train your model or agent using only the information provided in datasets/train/, then evaluate on problems in datasets/test/ without access to documentation. This tests whether the model has truly internalized the API knowledge.
Before running the rollout scripts, you need to deploy the model (e.g., Qwen3-8B) locally. We support deployment via vLLM or SGLang at localhost:8800.
Option 1: Deploy with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--port 8800 \
--host localhostOption 2: Deploy with SGLang
pip install "sglang[all]"
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--port 8800 \
--host localhostRun the query_only.py script to perform inference using only the query content:
cd src
python query_only.py \
--num_workers 1 \
--input_path ../datasets/test/single_test.jsonl \
--output_path ../rollout_results/query_only.jsonl \
--model_name Qwen3-8B \
--host localhost \
--ports 8800 \
--sample_count 1 \
--temperature 0.6 \
--max_length 8192Run the query_doc.py script to perform inference with API documentation (requires specifying the document path):
cd src
python query_doc.py \
--num_workers 1 \
--input_path ../datasets/test/single_test.jsonl \
--output_path ../rollout_results/query_doc.jsonl \
--doc_path ../datasets/train/api_doc.jsonl \
--model_name Qwen3-8B \
--host localhost \
--ports 8800 \
--sample_count 1 \
--temperature 0.6 \
--max_length 8192| Parameter | Description |
|---|---|
--num_workers |
Number of parallel worker processes (adjust based on hardware) |
--input_path |
Path to input test dataset (JSONL format) |
--output_path |
Path to save rollout (inference) results |
--doc_path |
Path to API documentation (required only for query_doc.py) |
--model_name |
Name of the model to use (e.g., Qwen3-8B) |
--host |
Host address of locally deployed models |
--ports |
Port(s) of locally deployed models |
--base_url |
Base URL for OpenAI-compatible APIs |
--api_key |
API key for OpenAI-compatible models |
--sample_count |
Number of samples to generate per input |
--temperature |
Sampling temperature (higher values = more random outputs) |
--max_length |
Maximum length of generated text by the model |
The evaluation phase requires building a Docker sandbox for safe code execution, followed by filtering correct inference trajectories.
The sandbox provides a secure environment for code execution and is deployed at http://localhost:8111 by default:
# Build and start the Docker sandbox (Docker must be running)
bash sandbox_build_and_run.shNote: To change the sandbox port, modify the port configuration in
src/evaluation/worker.py.
After the sandbox is successfully started, execute the evaluation script to filter correct results:
cd src
python filter_correct_trajectory.py \
--input_path ../rollout_results/query_only.jsonl \
--num_workers 64 \
--output_path ../evaluation_results/correct_trajectories.jsonl # Optional: Path to save correct trajectoriesIf you are not using our generation scripts (query_only.py or query_doc.py) and wish to evaluate your own model outputs, you must format your rollout results as a JSONL file. Each line should be a dictionary containing the following keys:
| Key | Description |
|---|---|
query |
The original question from the dataset. |
response |
The model's generation, containing the execution code wrapped in python blocks and the reasoning process. |
test_cases |
The original test cases from the dataset. Format: [{"input":..., "output":...}, ...]. |
right_exe_result |
The original ground truth executable result string from the dataset. |
Once your data is formatted correctly, you can directly run the evaluation script above.
First, run the query_doc.py to make rollouts.
Remember to deploy model via vLLM or SGLang before rollout.
cd src
python query_doc.py \
--num_workers 1 \
--input_path ../datasets/test/train.jsonl \
--output_path ../SFT_data/rollout_SFT_data.jsonl \
--doc_path ../datasets/train/api_doc.jsonl \
--model_name Qwen3-8B \
--host localhost \
--ports 8800 \
--sample_count 5 \
--temperature 0.6 \
--max_length 8192Then, run the filter_correct_trajectory.py to filter out corrcet trajectories.
Remember to Build sandbox before filtering.
python filter_correct_trajectory.py --input_path ../SFT_data/rollout_SFT_data.jsonl --output_path ../SFT_data/valid_SFT_data.jsonl --num_workers 64- Adjust
--num_workersbased on your hardware resources (avoid overloading the system) - The sandbox must remain running during the entire evaluation process
- All output paths will be created automatically if they do not exist
This project is licensed under the MIT License.
You can check LICENSE.md
If you find our work or dataset helpful for your research, please consider citing our paper:
@article{yuan2026se,
title={SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization},
author={Yuan, Jiarui and Jin, Tailin and Chen, Weize and Liu, Zeyuan and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2602.04811},
year={2026}
}