Skip to content

[EMNLP 2025 ๐Ÿ”ฅ] Promptception is framework to evaluate prompt sensitivity in Large Multimodal Models on MCQA, covering 61 prompt types across 10 models and 3 benchmarks.

License

Notifications You must be signed in to change notification settings

insafim/Promptception

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

52 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Promptception: How Sensitive Are Large Multimodal Models to Prompts? [EMNLP 2025 ๐Ÿ”ฅ]

Image

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Swiss Federal Institute of Technology Lausanne (EPFL), Australian National University

paper Website Dataset

Official GitHub repository for Promptception: How Sensitive Are Large Multimodal Models to Prompts?.

๐Ÿ“ข Latest Updates

  • Aug-2025: Promptception is accepted at EMNLP 2025 (Findings)! ๐ŸŽŠ๐ŸŽŠ
  • Nov-2025: Mohamed Insaf Ismithdeen will be presenting Promptception as a poster at EMNLP 2025 (Findings Session 3, Nov 7). ๐Ÿ“โœจ

Overview

Overview

Despite the success of Large Multimodal Models (LMMs) in recent years, prompt design for LMMs in Multipleโ€‘Choice Question Answering (MCQA) remains poorly understood. We show that even minor variations in prompt phrasing and structure can lead to accuracy deviations of up to 15% for certain prompts and models. This variability poses a challenge for transparent and fair LMM evaluation, as models often report their best-case performance using carefully selected prompts. To address this, we introduce Promptception, a systematic framework for evaluating prompt sensitivity in LMMs. It consists of 61 prompt types, spanning 15 categories and 6 supercategories, each targeting specific aspects of prompt formulation, and is used to evaluate 10 LMMs ranging from lightweight openโ€‘source models to GPT-4o and Gemini 1.5 Pro, across 3 MCQA benchmarks: MMStar, MMMUโ€‘Pro, MVBench. Our findings reveal that proprietary models exhibit greater sensitivity to prompt phrasing, reflecting tighter alignment with instruction semantics, while openโ€‘source models are steadier but struggle with nuanced and complex phrasing. Based on this analysis, we propose Prompting Principles tailored to proprietary and open-source LMMs, enabling more robust and fair model evaluation.


๐Ÿ† Highlights

  1. Comprehensive Prompt Sensitivity Analysis: We present the most extensive study to date on the impact of prompt variations across diverse multimodal benchmarks and LMM architectures. To facilitate this study, we introduce Promptception, a systematic evaluation framework comprising of 61 prompt types, organized into 15 categories and 6 supercategories, each designed to probe specific aspects of prompt formulation in LMMs.
  2. Evaluation Across Models, Modalities, and Benchmarks: We assess prompt sensitivity across a diverse set of model sizes and architectures, including both open-source and proprietary LMMs. Our analysis spans multiple modalities and benchmarks; MMStar (single image), MMMU-Pro (multi-image), and MVBench (video) and we further evaluate sensitivity across various question dimensions within these benchmarks to ensure a comprehensive understanding.
  3. Best Practices for Prompting: We identify key trends in prompting and propose Prompting Principles for effective and consistent evaluation of LMMs.

Getting started with Promptception

Downloading and Setting Up the Datasets

Download the datasets (MMMU-Pro, MMStar, and MVBench) using this link (zipped).

After downloading and unzipping, arrange them as follows:

Datasets/
|โ€“โ€“ MMMU-Pro/
|   |โ€“โ€“ Images-standard/
|   |โ€“โ€“ Images-vision/
|   |โ€“โ€“ MMMU-Pro_standard_4options.json
|   |โ€“โ€“ MMMU-Pro_standard_10options.json
|   |โ€“โ€“ MMMU-Pro_Vision_no-options.json
|
|โ€“โ€“ MMStar/
|   |โ€“โ€“ MMStar.json
|
|โ€“โ€“ MVBench/
|   |โ€“โ€“ mvbench_videos/
|   |โ€“โ€“ mvbench.json
|   |โ€“โ€“ mvbench_100.json

Prompts

Prompts

All prompt templates are provided in the Prompts/ directory as .yaml files.
Select the appropriate file depending on the modality (image/video) and the model type (open-source vs closed-source such as GPT-4o, Gemini-1.5 Pro).


๐Ÿ› ๏ธ Setup and Usage

To replicate our experiments and run inference with Hugging Face Transformers on NVIDIA GPUs, follow the steps below.
Our setup was tested on Python 3.10 with CUDA-enabled PyTorch.

  1. Clone the repository:
git clone https://github.com/insafim/Promptception.git
  1. Change directory:
cd Promptception
  1. Environment setup We used Python 3.10 with CUDA-enabled PyTorch for GPU inference.

    a) Create and activate a new environment:

    conda create --name promptception python=3.10
    conda activate promptception

    b) Install all required dependencies (for both open-source Hugging Face models and closed-source APIs):

    pip install pillow==10.1.0 \
                torch==2.1.2 \
                torchvision==0.16.2 \
                transformers==4.40.0 \
                sentencepiece==0.1.99 \
                decord \
                openai \
                opencv-python \
                google-generativeai

๐Ÿ”ฎ Inference

To run inference on a specific dataset/model:

# Example: Inference on MMMU-Pro with GPT-4o
bash Infer/mmmu-pro/infer_mmmu-pro_gpt4o.sh

Raw outputs will be saved in:

Results/<Dataset>/<Model>/*.json

๐Ÿ“Š Evaluation

To evaluate the inference results:

# Example: Evaluation on MMMU-Pro results
bash Evaluate/mmmu-pro/eval_mmmu-pro_all.sh

After running evaluation scripts, youโ€™ll get two types of outputs:

  1. Updated JSONs with extracted answers saved under Results/<Dataset>/<Model>/Extract_Choice/*.json

    Results/MMMU-Pro/MMMU-Pro_GPT4o/Extract_Choice/mmmu-pro_gpt4o_s4_1.1_updated.json
  2. Accuracy reports (Overall + Per-Category) saved as .txt files under Eval_Output/<Dataset>/<Model>/

    Eval_Output/MMMU-Pro/s4/MMMU-Pro_Gemini1.5/eval_mmmu-pro_gpt4o_s4_1.1.txt

Citation ๐Ÿ“œ

If you are using Promptception in your research or applications, please cite using this BibTeX:

@misc{ismithdeen2025promptceptionsensitivelargemultimodal,
      title={Promptception: How Sensitive Are Large Multimodal Models to Prompts?}, 
      author={Mohamed Insaf Ismithdeen and Muhammad Uzair Khattak and Salman Khan},
      year={2025},
      eprint={2509.03986},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.03986}, 
}

License ๐Ÿ“œ

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! ๐ŸŒŸ Please raise any issues or questions here.

About

[EMNLP 2025 ๐Ÿ”ฅ] Promptception is framework to evaluate prompt sensitivity in Large Multimodal Models on MCQA, covering 61 prompt types across 10 models and 3 benchmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published