Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Swiss Federal Institute of Technology Lausanne (EPFL), Australian National University
Official GitHub repository for Promptception: How Sensitive Are Large Multimodal Models to Prompts?.
- Aug-2025: Promptception is accepted at EMNLP 2025 (Findings)! ๐๐
- Nov-2025: Mohamed Insaf Ismithdeen will be presenting Promptception as a poster at EMNLP 2025 (Findings Session 3, Nov 7). ๐โจ

Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in MultipleโChoice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight openโsource models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMUโPro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while openโsource models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.
- Comprehensive Prompt Sensitivity Analysis: We present the most extensive study to date on the impact of prompt variations across diverse multimodal benchmarks and LMM architectures. To facilitate this study, we introduce Promptception, a systematic evaluation framework comprising of 61 prompt types, organized into 15 categories and 6 supercategories, each designed to probe specific aspects of prompt formulation in LMMs.
- Evaluation Across Models, Modalities, and Benchmarks: We assess prompt sensitivity across a diverse set of model sizes and architectures, including both open-source and proprietary LMMs. Our analysis spans multiple modalities and benchmarks; MMStar (single image), MMMU-Pro (multi-image), and MVBench (video) and we further evaluate sensitivity across various question dimensions within these benchmarks to ensure a comprehensive understanding.
- Best Practices for Prompting: We identify key trends in prompting and propose Prompting Principles for effective and consistent evaluation of LMMs.
Download the datasets (MMMU-Pro, MMStar, and MVBench) using this link (zipped).
After downloading and unzipping, arrange them as follows:
Datasets/
|โโ MMMU-Pro/
| |โโ Images-standard/
| |โโ Images-vision/
| |โโ MMMU-Pro_standard_4options.json
| |โโ MMMU-Pro_standard_10options.json
| |โโ MMMU-Pro_Vision_no-options.json
|
|โโ MMStar/
| |โโ MMStar.json
|
|โโ MVBench/
| |โโ mvbench_videos/
| |โโ mvbench.json
| |โโ mvbench_100.json
All prompt templates are provided in the Prompts/ directory as .yaml files.
Select the appropriate file depending on the modality (image/video) and the model type (open-source vs closed-source such as GPT-4o, Gemini-1.5 Pro).
To replicate our experiments and run inference with Hugging Face Transformers on NVIDIA GPUs, follow the steps below.
Our setup was tested on Python 3.10 with CUDA-enabled PyTorch.
- Clone the repository:
git clone https://github.com/insafim/Promptception.git- Change directory:
cd Promptception-
Environment setup We used Python 3.10 with CUDA-enabled PyTorch for GPU inference.
a) Create and activate a new environment:
conda create --name promptception python=3.10 conda activate promptception
b) Install all required dependencies (for both open-source Hugging Face models and closed-source APIs):
pip install pillow==10.1.0 \ torch==2.1.2 \ torchvision==0.16.2 \ transformers==4.40.0 \ sentencepiece==0.1.99 \ decord \ openai \ opencv-python \ google-generativeai
To run inference on a specific dataset/model:
# Example: Inference on MMMU-Pro with GPT-4o
bash Infer/mmmu-pro/infer_mmmu-pro_gpt4o.shRaw outputs will be saved in:
Results/<Dataset>/<Model>/*.jsonTo evaluate the inference results:
# Example: Evaluation on MMMU-Pro results
bash Evaluate/mmmu-pro/eval_mmmu-pro_all.shAfter running evaluation scripts, youโll get two types of outputs:
-
Updated JSONs with extracted answers saved under
Results/<Dataset>/<Model>/Extract_Choice/*.jsonResults/MMMU-Pro/MMMU-Pro_GPT4o/Extract_Choice/mmmu-pro_gpt4o_s4_1.1_updated.json
-
Accuracy reports (Overall + Per-Category) saved as .txt files under
Eval_Output/<Dataset>/<Model>/Eval_Output/MMMU-Pro/s4/MMMU-Pro_Gemini1.5/eval_mmmu-pro_gpt4o_s4_1.1.txt
If you are using Promptception in your research or applications, please cite using this BibTeX:
@misc{ismithdeen2025promptceptionsensitivelargemultimodal,
title={Promptception: How Sensitive Are Large Multimodal Models to Prompts?},
author={Mohamed Insaf Ismithdeen and Muhammad Uzair Khattak and Salman Khan},
year={2025},
eprint={2509.03986},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.03986},
}
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Looking forward to your feedback, contributions, and stars! ๐ Please raise any issues or questions here.


