1Tencent, 2PKU, 2NUS, 2SEU, 2NJU
⚡We will actively maintain this repository and incorporate new research as it emerges. If you have any questions, please contact swordli@tencent.com. Welcome to collaborate on academic research and writing papers together.
Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of 200+ benchmarks and evaluations for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to better support the development of MLLMs.
Comprehensive Evaluation
- "Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want". Lin W, Wei X, An R, et al.. arXiv 2024. [Paper] [Github].
- "CHEF: A COMPREHENSIVE EVALUATION FRAMEWORK FOR STANDARDIZED ASSESSMENT OF MULTIMODAL LARGE LANGUAGE MODELS". Shi Z, Wang Z, Fan H, et al. arXiv 2023. [paper] [Github].
Fine-grained Perception Image Understanding
General Reasoning Knowledge-based Reasoning Intelligence&Cognition
Text-rich VQA Decision-making Agents Diverse Cultures&Languages Other Applications Long-context
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github]. Instruction Following
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
- "". **. . [Paper] [Github].
Long-context
- Mile-Bench "MileBench: Benchmarking MLLMs in Long Context". Song D, Chen S, Chen G H, et al.. arXiv 2024. [Paper] [Github].
- MMNeedle "Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models". Wang H, Shi H, Tan S, et al.. arXiv 2024. [Paper] [Github].
- MLVU "MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding". Zhou J, Shu Y, Zhao B, et al.. arXiv 2024. [Paper] [Github]. Instruction Following
- CoIN "CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model". Chen C, Zhu J, Luo X, et al.. arXiv 2024. [Paper] [Github].
- MIA-Bench "MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs". Qian Y, Ye H, Fauconnier J P, et al.. arXiv 2024. [Paper] [Github].
- DEMON "Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions". Li J, Pan K, Ge Z, et al.. ICLR 2023. [Paper] [Github].
- VisIT-Bench "VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use". Bitton Y, Bansal H, Hessel J, et al.. NeurIPS 2023. [Paper] [Github].
- POPE "Evaluating Object Hallucination in Large Vision-Language Models". Li Y, Du Y, Zhou K, et al.. EMNLP 2023. [Paper] [Github].
- GAVIE "Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning". Liu F, Lin K, Li L, et al.. ICLR 2023. [Paper] [Github].
- HaELM "Evaluation and Analysis of Hallucination in Large Vision-Language Models". Wang J, Zhou Y, Xu G, et al.. arXiv 2023. [Paper] [Github].
- M-HalDetect "Detecting and Preventing Hallucinations in Large Vision Language Models". Gunjal A, Yin J, Bas E.. AAAI 2024. [Paper] [Github].
- Bingo "Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges". Cui C, Zhou Y, Yang X, et al.. arXiv 2023. [Paper] [Github].
- HallusionBench "HALLUSIONBENCH: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models". Guan T, Liu F, Wu X, et al.. CVPR 2024. [Paper] [Github].
- VHTest "Visual Hallucinations of Multi-modal Large Language Models". Huang W, Liu H, Guo M, et al.. arXiv 2024. [Paper] [Github].
- CorrelationQA "The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs". Han T, Lian Q, Pan R, et al.. arXiv 2024. [Paper] [Github].
- CHAIR "Object Hallucination in Image Captioning". Rohrbach A, Hendricks L A, Burns K, et al.. EMNLP 2018. [Paper] [Github].
- MHaluBench "Unified Hallucination Detection for Multimodal Large Language Models". Chen X, Wang C, Xue Y, et al.. arXiv 2024. [Paper] [Github].
- VideoHallucer "VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models". Wang Y, Wang Y, Zhao D, et al.. arXiv 2024. [Paper] [Github].
- MMHAL-BENCH "Aligning Large Multimodal Models with Factually Augmented RLHF". Sun Z, Shen S, Cao S, et al.. arXiv 2023. [Paper] [Github].
- AMBER "AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation". Wang J, Wang Y, Xu G, et al.. arXiv 2023. [Paper] [Github].
- MMECeption "GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data". Cao L, Buchner V, Senane Z, et al.. arXiv 2024. [Paper] [Github].
Robustness
- MAD-Bench "How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts". Qian Y, Zhang H, Yang Y, et al.. arXiv 2024. [Paper] [Github].
- MMR "Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions". Liu Y, Liang Z, Wang Y, et al.. arXiv 2024. [Paper] [Github].
- MM-SpuBench "MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs". Ye W, Zheng G, Ma Y, et al.. arXiv 2024. [Paper] [Github].
- MM-SAP "MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception". Wang Y, Liao Y, Liu H, et al.. arXiv 2024. [Paper] [Github].
- BenchLMM "BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models". Cai R, Song Z, Guan D, et al.. arXiv 2023. [Paper] [Github].
- VQAv2-IDK "Visually Dehallucinative Instruction Generation: Know What You Don’t Know". Cha S, Lee J, Lee Y, et al.. ICASSP 2024. [Paper] [Github].
Safety
- MMUBench "Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models". Li J, Wei Q, Zhang C, et al.. arXiv 2024. [Paper] [Github].
- JailBreakV-28K "JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks". Luo W, Ma S, Liu X, et al.. arXiv 2024. [Paper] [Github].
- MultiTrust "Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study". Zhang Y, Huang Y, Sun Y, et al.. arXiv 2024. [Paper] [Github].
- MM-SafetyBench "MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models". Liu X, Zhu Y, Gu J, et al.. ECCV 2024. [Paper] [Github].
- SHIELD "SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models". Shi Y, Gao Y, Lai Y, et al.. arXiv 2024. [Paper] [Github].
- RTVLM "Red teaming visual language models". Li M, Li L, Yin Y, et al.. arXiv 2024. [Paper] [Github].
Temporal Perception
- MVBench "MVBench: A Comprehensive Multi-modal Video Understanding Benchmark". Li K, Wang Y, He Y, et al.. CVPR 2024. [Paper] [Github].
- TimeIT "Timechat: A time-sensitive multimodal large language model for long video understanding". Ren S, Yao L, Li S, et al.. CVPR 2024. [Paper] [Github].
- ViLMA "ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models". Kesen I, Pedrotti A, Dogan M, et al.. ICLR 2024. [Paper] [Github].
- VITATECS "VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models". Li S, Li L, Ren S, et al.. arXiv 2023. [Paper] [Github].
- TempCompass "TempCompass: Do Video LLMs Really Understand Videos?". Liu Y, Li S, Liu Y, et al.. arXiv 2024. [Paper] [Github].
- OSCaR "OSCaR: Object State Captioning and State Change Representation". Nguyen N, Bi J, Vosoughi A, et al.. arXiv 2024. [Paper] [Github].
- ADLMCQ "LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living". Chakraborty R, Sinha A, Reilly D, et al.. arXiv 2024. [Paper] [Github].
- Perception Test "Perception Test: A Diagnostic Benchmark for Multimodal Video Models". Patraucean V, Smaira L, Gupta A, et al.. NeurIPS2024. [Paper] [Github].
Long Video Understanding
- MovieChat-1k "Moviechat: From dense token to sparse memory for long video understanding". **. . [Paper] [Github].
- EgoSchema "EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding". **. . [Paper] [Github].
- Event-Bench "Towards Event-oriented Long Video Understanding". **. . [Paper] [Github].
- MLVU "MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding". **. . [Paper] [Github].
Comprehensive Evaluation
- Video-Bench "Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models". Ning M, Zhu B, Xie Y, et al.. arXiv 2023. [Paper] [Github].
- MMBench-Video "MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding". Fang X, Mao K, Duan H, et al.. arXiv 2024. [Paper] [Github].
- Video-MME "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis". Fu C, Dai Y, Luo Y, et al.. arXiv 2024. [Paper] [Github].
- AutoEval-Video "AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering". Chen X, Lin Y, Zhang Y, et al.. arXiv 2023. [Paper] [Github].
- MMWorld "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos". He X, Feng W, Zheng K, et al.. arXiv 2024. [Paper] [Github].
- WorldNet "WorldGPT: Empowering LLM as Multimodal World Model". Ge Z, Huang H, Zhou M, et al.. arXiv 2024. [Paper] [Github].
- Dynamic-SUPERB "Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech". Huang C, Lu K H, Wang S H, et al.. ICASSP 2024. [Paper] [Github].
- MuChoMusic "MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models". Weck B, Manco I, Benetos E, et al.. arXiv 2024. [Paper] [Github].
- AIR-Bench "AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension". Yang Q, Xu J, Liu W, et al.. arXiv 2024. [Paper] [Github].
- ScanQA "ScanQA: 3D Question Answering for Spatial Scene Understanding". Azuma D, Miyanishi T, Kurita S, et al.. CVPR 2022. [Paper] [Github].
- ScanReason "ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities". Zhu C, Wang T, Zhang W, et al.. arXiv 2024. [Paper] [Github].
- LAMM "LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark". Yin Z, Wang J, Cao J, et al.. NeurIPS 2024. [Paper] [Github].
- SpatialRGPT "SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model". Cheng A C, Yin H, Fu Y, et al.. arXiv 2024. [Paper] [Github].
- M3DBench "M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts". Li M, Chen X, Zhang C, et al.. arXiv 2023. [Paper] [Github].
- MCUB "Model Composition for Multimodal Large Language Models". Chen C, Du Y, Fang Z, et al.. arXiv 2024. [Paper] [Github].
- AVQA "AVQA: A Dataset for Audio-Visual Question Answering on Videos". Yang P, Wang X, Duan X, et al.. MM 2022. [paper] [Github].
- MusicAVQA "Learning to Answer Questions in Dynamic Audio-Visual Scenarios". Li G, Wei Y, Tian Y, et al.. CVPR 2022. [Paper] [Github].
- MMT-Bench "MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI". Ying K, Meng F, Wang J, et al.. arXiv 2024. [paper] [Github].