MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning
Xukai Wang*, Xuanbo Liu*, Mingrui Chen*, Haitian Zhong*, Xuanlin Yang*, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong
MorphoBench is an adaptive reasoning benchmark for large-scale models. It curates over 1,300 multidisciplinary questions and dynamically adjusts task difficulty based on model reasoning traces, providing a scalable and reliable framework for evaluating the reasoning performance of advanced models like o3 and GPT-5.
The MorphoBench dataset is available on Hugging Face: OpenDCAI/MorphoBench
from datasets import load_dataset
dataset = load_dataset("OpenDCAI/MorphoBench")After downloading, create a data/ folder inside your local project directory and place the datasets there:
MorphoBench/
├── adaption/
├── asset/
├── data/
│ ├── Morpho_P_Perturbed/
│ ├── Morpho_P_v0/
│ ├── Morpho_R_Complex/
│ ├── Morpho_R_Lite/
│ └── Morpho_R_v0/
├── scripts/
├── output/
└── ...
cd Morphobench
pip install -r requirements.txtGenerate model predictions for all datasets:
bash scripts/run_batch.shPredictions will be saved under:
output/infer_result/
Evaluate the reasoning performance:
bash scripts/evaluate_batch.shEvaluation metrics will be stored in:
output/eval_result/
The following figure summarizes the evaluation results on MorphoBench
This repository adapts evaluation script from Humanity's Last Exam. We sincerely thank the authors for their valuable contributions to the research community.

