A Comprehensive Survey on Evaluating Reasoning Capabilities in Multimodal Large Language Models.
Authors: Yaya Shi and Zongyang Ma
-
WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
- π Paper: https://arxiv.org/pdf/2407.01284
- π€ Dataset: https://huggingface.co/datasets/We-Math/We-Math
-
(ICLR 2024 Oral) Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models
- π Paper: https://arxiv.org/pdf/2310.02255
- π Project: https://mathvista.github.io
- π» Code: https://github.com/lupantech/MathVista
- π€ Dataset: https://huggingface.co/datasets/AI4Math/MathVista
- π LeaderBoard: https://mathvista.github.io/#leaderboard
-
(ECCV 2024) MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems
- π Paper: https://arxiv.org/pdf/2403.14624
- π Project: https://mathverse-cuhk.github.io
- π» Code: https://github.com/ZrrSkywalker/MathVerse
- π€ Dataset: https://huggingface.co/datasets/AI4Math/MathVerse
- π LeaderBoard: https://mathverse-cuhk.github.io/#leaderboard
-
(NeurIPS DB Track 2024) MATH-Vision: Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
- π Paper: https://arxiv.org/pdf/2402.14804
- π Project: https://mathllm.github.io/mathvision
- π» Code: https://github.com/mathllm/MATH-V
- π€ Dataset: https://huggingface.co/datasets/MathLLMs/MathVision
- π LeaderBoard: https://mathllm.github.io/mathvision/#leaderboard
-
MathScape: Evaluating MLLMs in multimodal Math Scenarios through a Hierarchical Benchmark γδΈζγ
-
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of LMMs γδΈζγ
- π Paper: https://arxiv.org/pdf/2409.02834
- π» Code: https://github.com/ECNU-ICALK/EduChat-Math/
-
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
- π Paper: https://arxiv.org/pdf/2402.14008
- π€ Dataset: https://huggingface.co/datasets/Hothan/OlympiadBench
-
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts
-
π Paper: https://arxiv.org/pdf/2502.20808
-
π Project: https://eternal8080.github.io/MV-MATH.github.io/
-
π» Code: https://github.com/eternal8080/MV-MATH
-
π€ Dataset: https://huggingface.co/datasets/PeijieWang/MV-MATH
-
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
- π Paper: https://arxiv.org/pdf/2312.15915
- π Project: https://chartbench.github.io/
- π€ Dataset: https://huggingface.co/datasets/SincereX/ChartBench
-
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems
- π Paper: https://arxiv.org/pdf/2410.14179v2
- π» Code: https://github.com/Zivenzhu/Multi-chart-QA
-
(Neurips DB Track 2024) Chartxiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
- π Paper: https://arxiv.org/pdf/2406.18521
- π Project: https://charxiv.github.io/
- π€ Dataset: https://huggingface.co/datasets/princeton-nlp/CharXiv
-
M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
- π Paper: https://arxiv.org/pdf/2405.15638
- π Project: https://m4u-benchmark.github.io/m4u.github.io/
- π€ Dataset: https://huggingface.co/datasets/M4U-Benchmark/M4U
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
- π Paper: https://arxiv.org/pdf/2311.16502
- π€ Dataset: https://huggingface.co/datasets/MMMU/MMMU
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
- π Paper: https://arxiv.org/pdf/2409.02813
- π€ Dataset: https://huggingface.co/datasets/MMMU/MMMU_Pro
-
Science qa : Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
- π Paper: https://arxiv.org/pdf/2209.09513
- π€ Dataset: https://huggingface.co/datasets/derek-thomas/ScienceQA
-
TheoremQA: A Theorem-driven Question Answering Dataset
- π Paper: https://arxiv.org/pdf/2305.12524
- π€ Dataset: https://huggingface.co/datasets/TIGER-Lab/TheoremQA
-
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
- π Paper: https://arxiv.org/pdf/2501.05444v1
- π€ Dataset: https://huggingface.co/datasets/luckychao/EMMA
-
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation
- π Paper: https://arxiv.org/pdf/2402.15745
- π» Code: https://github.com/OpenMOSS/GAOKAO-MM
-
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [δΈθ±ζ]
- π Paper: https://arxiv.org/pdf/2402.14008
- π€ Dataset: https://huggingface.co/datasets/Hothan/OlympiadBench
-
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark [δΈζ]
- π Paper: https://arxiv.org/pdf/2401.11944
- π€ Dataset: https://huggingface.co/datasets/m-a-p/CMMMU
-
ChartMimic: Evaluating LMMβs Cross-Modal Reasoning Capability via Chart-to-Code Generation
- π Paper: https://arxiv.org/pdf/2406.09961
- π» Code: https://github.com/ChartMimic/ChartMimic
- π€ Dataset: https://huggingface.co/datasets/ChartMimic/ChartMimic
-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
- π Paper: https://arxiv.org/pdf/2405.07990
- π€ Dataset: https://huggingface.co/datasets/TencentARC/Plot2Code
-
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
- π Paper: https://arxiv.org/pdf/2410.12381
- π» Code: https://github.com/HumanEval-V/HumanEval-V-Benchmark
- π€ Dataset: https://huggingface.co/datasets/HumanEval-V/HumanEval-V-Benchmark
- π LeaderBoard: https://humaneval-v.github.io/#leaderboard
-
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models
- π Paper: https://arxiv.org/pdf/2502.00698
- π€ Dataset: https://huggingface.co/datasets/huanqia/MM-IQ
-
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
- π Paper: https://arxiv.org/pdf/2407.04973
- π» Code: https://github.com/Yijia-Xiao/LogicVista
-
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
- π Paper: https://arxiv.org/pdf/2502.01081
- π» Code: https://github.com/declare-lab/LLM-PuzzleTest/
-
Computational Meme Understanding: A Survey
-
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
- π Paper: https://arxiv.org/pdf/2406.05862
- π€ Dataset: https://huggingface.co/datasets/m-a-p/II-Bench
-
Can MLLMs Understand the Deep Implication Behind Chinese Images? [δΈζ]
- π Paper: https://arxiv.org/pdf/2410.13854
- π Project: https://cii-bench.github.io/
- π€ Dataset: https://huggingface.co/datasets/m-a-p/CII-Bench
- π LeaderBoard: https://cii-bench.github.io/#leaderboard
-
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
- π Paper: https://arxiv.org/pdf/2412.11906
-
GPT-4V(ision) as A Social Media Analysis Engine
- π Paper: https://arxiv.org/pdf/2311.07547
- π» Code: https://github.com/VIStA-H/GPT-4V_Social_Media
-
Geolocation with Real Human Gameplay Data:A Large-Scale Dataset and Human-Like Reasoning Framework
- π Paper: https://arxiv.org/pdf/2502.13759
- NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models
- π Paper: https://arxiv.org/pdf/2403.01777
- π» Code: https://github.com/lizhouf/NPHardEval4V
-
Autonomous Driving
- Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction
- π Paper: https://arxiv.org/pdf/2310.04671v4
- π» Code: https://github.com/DHPR-dataset/DHPR-dataset
- π€ Dataset: https://huggingface.co/datasets/DHPR/Driving-Hazard-Prediction-and-Reasoning
- Exploring the Potential of Multi-Modal AI for Driving Hazard Prediction
-
Robot Manipulation
- A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
- π Paper: https://arxiv.org/pdf/2502.08643
- π» Code: https://github.com/shivanshpatel35/IKER
- π Project: https://iker-robot.github.io/
- π Project: https://simpler-env.github.io/
- A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
-
Gui Agent
-
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
- π Paper: https://arxiv.org/pdf/2501.04575
- π» Code: https://github.com/Reallm-Labs/InfiGUIAgent
- π€ Dataset: https://huggingface.co/datasets/Reallm-Labs/InfiGUIAgent-Data
-
Mind2Web: Towards a Generalist Agent for the Web
- π Paper: https://arxiv.org/abs/2306.06070
- π Project: https://osu-nlp-group.github.io/Mind2Web/
- π» Code: https://github.com/OSU-NLP-Group/Mind2Web
- π€ Dataset: https://huggingface.co/datasets/osunlp/Mind2Web
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- π Paper: https://arxiv.org/pdf/2401.10935
-
-
Spatial Planing
-
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
- π Paper: https://arxiv.org/pdf/2501.07542
-
iVISPAR β An Interactive Visual-Spatial Reasoning Benchmark for VLMs
- π Paper: https://arxiv.org/pdf/2502.03214v1
- π» Code: https://github.com/SharkyBamboozle/iVISPAR
-
-
Spatial Relationship
-
PulseCheck457: A Diagnostic Benchmark for Comprehensive Spatial Reasoning of Large Multimodal Models
- π Paper: https://www.arxiv.org/pdf/2502.08636
-
Defining and Evaluating Visual Language Modelsβ Basic Spatial Abilities:A Perspective from Psychometrics
- π Paper: https://arxiv.org/pdf/2502.11859
-
-
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
- π Paper: https://arxiv.org/pdf/2405.16473
- π€ Dataset: https://huggingface.co/datasets/LightChen2333/M3CoT
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
- π Paper: https://arxiv.org/pdf/2303.11381
- π» Code: https://github.com/microsoft/MM-REACT
-
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
- π Paper: https://arxiv.org/pdf/2502.09621
- π Project: https://mmecot.github.io/
- π€ Dataset: https://huggingface.co/datasets/CaraJ/MME-CoT
-
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
- π Paper: https://arxiv.org/pdf/2501.06186
- π€ Dataset: https://huggingface.co/datasets/omkarthawakar/VRC-Bench
-
(Neurips DB Track 2024) MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
- π Paper: https://arxiv.org/pdf/2407.16837
- π Project: https://compbench.github.io/
-
ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models
- π Paper: https://arxiv.org/pdf/2502.09696
- π Project: https://zerobench.github.io/
-
R1-onevision
- π Project: https://yangyi-vai.notion.site/r1-onevision
- π€ Dataset: https://huggingface.co/datasets/Fancy-MLLM/R1-Onevision-Bench
-
LRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
- π Paper: https://arxiv.org/pdf/2411.17451
- π Project: https://vl-rewardbench.github.io/
- π» Code: https://github.com/vl-rewardbench/VL_RewardBench
- π€ Dataset: https://huggingface.co/datasets/MMInstruction/VL-RewardBench
- π LeaderBoard: https://huggingface.co/spaces/MMInstruction/VL-RewardBench
- βοΈβοΈβοΈ Based on the aforementioned research, we are currently in the process of writing a survey paper and developing a benchmark.
- π€π€π€ We warmly invite everyone to join this collaborative project. Please feel free to reach out to us or submit pull requestsβyour contributions are highly valued!