Promptception: How Sensitive Are Large Multimodal Models to Prompts? [EMNLP 2025 🔥]

Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, Salman Khan

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Swiss Federal Institute of Technology Lausanne (EPFL), Australian National University

Official GitHub repository for Promptception: How Sensitive Are Large Multimodal Models to Prompts?.

📢 Latest Updates

Aug-2025: Promptception is accepted at EMNLP 2025 (Findings)! 🎊🎊
Nov-2025: Mohamed Insaf Ismithdeen will be presenting Promptception as a poster at EMNLP 2025 (Findings Session 3, Nov 7). 📝✨

Overview

Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multiple‑Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight open‑source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMU‑Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while open‑source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.

🏆 Highlights

Comprehensive Prompt Sensitivity Analysis: We present the most extensive study to date on the impact of prompt variations across diverse multimodal benchmarks and LMM architectures. To facilitate this study, we introduce Promptception, a systematic evaluation framework comprising of 61 prompt types, organized into 15 categories and 6 supercategories, each designed to probe specific aspects of prompt formulation in LMMs.
Evaluation Across Models, Modalities, and Benchmarks: We assess prompt sensitivity across a diverse set of model sizes and architectures, including both open-source and proprietary LMMs. Our analysis spans multiple modalities and benchmarks; MMStar (single image), MMMU-Pro (multi-image), and MVBench (video) and we further evaluate sensitivity across various question dimensions within these benchmarks to ensure a comprehensive understanding.
Best Practices for Prompting: We identify key trends in prompting and propose Prompting Principles for effective and consistent evaluation of LMMs.

Getting started with Promptception

Downloading and Setting Up the Datasets

Download the datasets (MMMU-Pro, MMStar, and MVBench) using this link (zipped).

After downloading and unzipping, arrange them as follows:

Datasets/
|–– MMMU-Pro/
|   |–– Images-standard/
|   |–– Images-vision/
|   |–– MMMU-Pro_standard_4options.json
|   |–– MMMU-Pro_standard_10options.json
|   |–– MMMU-Pro_Vision_no-options.json
|
|–– MMStar/
|   |–– MMStar.json
|
|–– MVBench/
|   |–– mvbench_videos/
|   |–– mvbench.json
|   |–– mvbench_100.json

Prompts

All prompt templates are provided in the Prompts/ directory as .yaml files.
Select the appropriate file depending on the modality (image/video) and the model type (open-source vs closed-source such as GPT-4o, Gemini-1.5 Pro).

🛠️ Setup and Usage

To replicate our experiments and run inference with Hugging Face Transformers on NVIDIA GPUs, follow the steps below.
Our setup was tested on Python 3.10 with CUDA-enabled PyTorch.

Clone the repository:

git clone https://github.com/insafim/Promptception.git

Change directory:

cd Promptception

Environment setup We used Python 3.10 with CUDA-enabled PyTorch for GPU inference.

a) Create and activate a new environment:

conda create --name promptception python=3.10
conda activate promptception

b) Install all required dependencies (for both open-source Hugging Face models and closed-source APIs):

pip install pillow==10.1.0 \
            torch==2.1.2 \
            torchvision==0.16.2 \
            transformers==4.40.0 \
            sentencepiece==0.1.99 \
            decord \
            openai \
            opencv-python \
            google-generativeai

🔮 Inference

To run inference on a specific dataset/model:

# Example: Inference on MMMU-Pro with GPT-4o
bash Infer/mmmu-pro/infer_mmmu-pro_gpt4o.sh

Raw outputs will be saved in:

Results/<Dataset>/<Model>/*.json

📊 Evaluation

To evaluate the inference results:

# Example: Evaluation on MMMU-Pro results
bash Evaluate/mmmu-pro/eval_mmmu-pro_all.sh

After running evaluation scripts, you’ll get two types of outputs:

Updated JSONs with extracted answers saved under Results/<Dataset>/<Model>/Extract_Choice/*.json
```
Results/MMMU-Pro/MMMU-Pro_GPT4o/Extract_Choice/mmmu-pro_gpt4o_s4_1.1_updated.json
```
Accuracy reports (Overall + Per-Category) saved as .txt files under Eval_Output/<Dataset>/<Model>/
```
Eval_Output/MMMU-Pro/s4/MMMU-Pro_Gemini1.5/eval_mmmu-pro_gpt4o_s4_1.1.txt
```

Citation 📜

If you are using Promptception in your research or applications, please cite using this BibTeX:

@misc{ismithdeen2025promptceptionsensitivelargemultimodal,
      title={Promptception: How Sensitive Are Large Multimodal Models to Prompts?}, 
      author={Mohamed Insaf Ismithdeen and Muhammad Uzair Khattak and Salman Khan},
      year={2025},
      eprint={2509.03986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.03986}, 
}

License 📜

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Assets		Assets
Datasets		Datasets
Eval_Output		Eval_Output
Evaluate		Evaluate
Infer		Infer
Prompts		Prompts
Results		Results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Promptception: How Sensitive Are Large Multimodal Models to Prompts? [EMNLP 2025 🔥]

Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, Salman Khan

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Swiss Federal Institute of Technology Lausanne (EPFL), Australian National University

📢 Latest Updates

Overview

🏆 Highlights

Getting started with Promptception

Downloading and Setting Up the Datasets

Prompts

Prompts

🛠️ Setup and Usage

🔮 Inference

📊 Evaluation

Citation 📜

License 📜

About

Uh oh!

Releases

Packages

Languages

License

insafim/Promptception

Folders and files

Latest commit

History

Repository files navigation

Promptception: How Sensitive Are Large Multimodal Models to Prompts? [EMNLP 2025 🔥]

Mohamed Insaf Ismithdeen, Muhammad Uzair Khattak, Salman Khan

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Swiss Federal Institute of Technology Lausanne (EPFL), Australian National University

📢 Latest Updates

Overview

🏆 Highlights

Getting started with Promptception

Downloading and Setting Up the Datasets

Prompts

Prompts

🛠️ Setup and Usage

🔮 Inference

📊 Evaluation

Citation 📜

License 📜

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages