MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

Xukai Wang*, Xuanbo Liu*, Mingrui Chen*, Haitian Zhong*, Xuanlin Yang*, Bohan Zeng, Jinbo Hu, Hao Liang, Junbo Niu, Xuchen Li, Ruitao Wu, Ruichuan An, Yang Shi, Liu Liu, Xu-Yao Zhang, Qiang Liu, Zhouchen Lin, Wentao Zhang, Bin Dong

📣 Overview

MorphoBench is an adaptive reasoning benchmark for large-scale models. It curates over 1,300 multidisciplinary questions and dynamically adjusts task difficulty based on model reasoning traces, providing a scalable and reliable framework for evaluating the reasoning performance of advanced models like o3 and GPT-5.

🎓 Dataset

The MorphoBench dataset is available on Hugging Face: OpenDCAI/MorphoBench

from datasets import load_dataset
dataset = load_dataset("OpenDCAI/MorphoBench")

After downloading, create a data/ folder inside your local project directory and place the datasets there:

MorphoBench/
├── adaption/
├── asset/
├── data/
│   ├── Morpho_P_Perturbed/
│   ├── Morpho_P_v0/
│   ├── Morpho_R_Complex/
│   ├── Morpho_R_Lite/
│   └── Morpho_R_v0/
├── scripts/
├── output/
└── ...

⚙️ Usage

Environment Setup

cd Morphobench
pip install -r requirements.txt

Run Inference

Generate model predictions for all datasets:

bash scripts/run_batch.sh

Predictions will be saved under:

output/infer_result/

Evaluate Model Results

Evaluate the reasoning performance:

bash scripts/evaluate_batch.sh

Evaluation metrics will be stored in:

output/eval_result/

📊 Evaluation Results

The following figure summarizes the evaluation results on MorphoBench

🙏 Acknowledgements

This repository adapts evaluation script from Humanity's Last Exam. We sincerely thank the authors for their valuable contributions to the research community.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
adaption		adaption
asset		asset
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

📣 Overview

🎓 Dataset

⚙️ Usage

Environment Setup

Run Inference

Evaluate Model Results

📊 Evaluation Results

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

OpenDCAI/MorphoBench

Folders and files

Latest commit

History

Repository files navigation

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

📣 Overview

🎓 Dataset

⚙️ Usage

Environment Setup

Run Inference

Evaluate Model Results

📊 Evaluation Results

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages