SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

This repository contains the official evaluation framework for SUPERChem, an expert-curated, reasoning-intensive multimodal benchmark for the rigorous evaluation of deep chemical reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs).

Abstract

Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, ceiling effects, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%. By combining high difficulty, controlled multimodality, and process-level metrics, SUPERChem provides a rigorous platform for diagnosing and advancing AI chemical reasoning toward expert-level scientific inquiry.

Key Features

Expert-Level Challenge: 500 reasoning-intensive problems curated by domain experts to test deep chemical reasoning and mitigate the ceiling effects seen in other benchmarks.
Process-Level Evaluation: Introduces Reasoning Path Fidelity (RPF), a metric to assess the alignment of a model's reasoning with expert-authored solution paths, distinguishing genuine understanding from "lucky guesses."
Controlled Multimodality: Each problem is available in both multimodal (with images) and text-only formats, enabling a rigorous, controlled analysis of a model's ability to integrate visual information.
Fine-Grained Ability Taxonomy: A systematic categorization of chemical knowledge and reasoning skills supports detailed diagnosis of model strengths and weaknesses across various sub-domains.
Contamination Resistant: Problems are newly authored or adapted from non-public sources and undergo a rigorous human-in-the-loop curation process to ensure quality and reduce the risk of data leakage from web-scraped training sets.

Repository Structure

This repository provides the tools to run evaluations on the SUPERChem dataset.

.
├── eval/               # Scripts for running evaluations and generating model outputs.
├── data/               # Datasets, metadata, human baselines, and raw evaluation results.
├── analysis/           # Scripts for processing results and generating analyses/plots.
└── results/            # Output directory for generated plots and figures.

eval/: Contains the core scripts for running evaluations.
- eval/config.yaml: Firstly please prepare the API following the eval/config.yaml.sample, and rename it to eval/config.yaml. Please do not upload your own API key to the public.
- eval/eval.py: Runs various model checkpoints to generate answers and tag their abilities.
- eval/eval_cot.py: Uses a judge model to perform a fine-grained evaluation of a model's reasoning (RPF scoring).
data/: Stores all necessary data for the evaluations.
- data/20251015_baseline.csv: Human performance baseline.
- data/ability_tags_description.json: Definitions for all ability tags.
- data/dataset_split_map.json: Pre-defined dataset splits based on difficulty.
- This folder also serves as the output location for raw data from the eval/ scripts.
analysis/: Includes Python scripts for post-processing and analyzing the evaluation data. This is where you can generate metrics and visualizations like radar charts, pass@k curves, and breakpoint analyses.
results/**: This folder is the designated output directory for the visual artifacts (plots, charts, etc.) generated by the scripts in the analysis/ directory.

Python Dependencies

pip install pandas pyarrow loguru openai pyyaml tqdm plotly playwright scipy seaborn Pillow streamlit ipykernel

Evaluation Workflow

A typical evaluation workflow follows these steps:

Configure Evaluation:
- Modify the shell scripts in the directory (, ) to specify the models you want to test, input files, and other parameters.
Run Evaluation:
- Execute the scripts from the directory to generate model answers and perform CoT evaluations.
- The raw and evaluated .jsonl files will be saved in the directory.
Analyze Results:
- Use the Python scripts in the directory to process the data stored in .
- For example, run calc_pass_withbaseline.py to get accuracy tables or draw_radar_plotly.py to visualize model capabilities.
View Outputs:
- The plots and figures generated by the analysis scripts will be saved in the directory.

Citation

If you use SUPERChem or this evaluation framework in your research, please cite our paper:

@misc{zhao2025superchemmultimodalreasoningbenchmark,
      title={SUPERChem: A Multimodal Reasoning Benchmark in Chemistry}, 
      author={Zehua Zhao and Zhixian Huang and Junren Li and Siyu Lin and Junting Zhou and Fengqi Cao and Kun Zhou and Rui Ge and Tingting Long and Yuexiang Zhu and Yan Liu and Jie Zheng and Junnian Wei and Rong Zhu and Peng Zou and Wenyu Li and Zekai Cheng and Tian Ding and Yaxuan Wang and Yizhao Yan and Tingru Wei and Haowei Ming and Weijie Mao and Chen Sun and Yiming Liu and Zichen Wang and Zuo Zhang and Tong Yang and Hao Ma and Zhen Gao and Jian Pei},
      year={2025},
      eprint={2512.01274},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.01274}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
analysis		analysis
data		data
eval		eval
results		results
view		view
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Abstract

Key Features

Repository Structure

Python Dependencies

Evaluation Workflow

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

catalystforyou/SUPERChem_eval

Folders and files

Latest commit

History

Repository files navigation

SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Abstract

Key Features

Repository Structure

Python Dependencies

Evaluation Workflow

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages