🌐 Homepage | 🔬 Paper | 👩💻 Code | 📊 Dataset | 📈 Evaluation | 🏆 Leaderboard
MMComposition aims to provide a comprehensive assessment of compositionality for Vision-Language Models (VLMs) -- the ability to understand and produce novel combinations of known visual and textual components. This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and critical areas for model improvement in VLM. MMComposition comprises 13 complex vision-language composition tasks, including:
Attribute Perception
Object Perception
Counting Perception
Relation Perception
Difference Spotting
Text Rendering
Visual Similarity
Attribute Reasoning
Object Reasoning
Counting Reasoning
Relation Reasoning
Object Interaction
Compositional Probing
@article{hua2024mmcomposition,
title={MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models},
author={Hua, Hang and Tang, Yunlong and Zeng, Ziyun and Cao, Liangliang and Yang, Zhengyuan and He, Hangfeng and Xu, Chenliang and Luo, Jiebo},
journal={arXiv preprint arXiv:2410.09733},
year={2024}
}