🔥 Official repository for the project "Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint"
- Overview
- Motivation
- Dataset
- Cognitive Skill Categories
- Evaluation Metrics & Scripts
- Annotation Tools
- How to Run
- Citing this work
- Acknowledgements
This codebase supports the research and experiments from "Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint".
Authors: Heekyung Lee,Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan (POSTECH & UC Berkeley)
We introduce a probe dataset of 432 hand-annotated English rebus puzzles, each requiring integration of imagery, spatial arrangement, and symbolic reasoning-challenging VLMs far beyond rote image captioning or straightforward question answering.
Recent VLMs have excelled at direct visual-text alignment, but fundamental gaps remain in their ability to solve tasks requiring abstract reasoning, compositional thinking, and cultural/phonetic inference. Rebus puzzles-visual riddles encoding phrases or concepts through images, wordplay, and spatial logic-are a demanding testbed for these higher-order cognitive skills.
Our goal: Systematically probe VLMs' capabilities and limitations in visual-linguistic reasoning using a carefully curated and categorized benchmark, with human and model baselines for reference.
- 432 hand-crafted rebus puzzles with curated images and answers
- Each puzzle is annotated with one or more of 11 cognitive skill categories (see below)
- Images sourced and quality-checked for consistency
- Each puzzle includes:
- Puzzle image
- Ground truth answer
- Skill category annotations
Sample puzzle: If the word "WATER" is written in a curved downward shape, the answer is "Waterfall".
Puzzles are annotated by required cognitive skill(s):
- Absence or Negation (AN): Recognizing missing or negated elements
- Font Style/Size (FS): Interpreting clues from font differences
- Image Recognition (IR): Identifying objects, symbols, or people
- Letter and Word Manipulation (LWM): Overlapping/hiding/repeating letters to form new meanings
- Phonetics and Wordplay (PW): Solving with homophones or puns
- Quantitative/Mathematical Reasoning (QMR): Object counting, math symbols, quantitative logic
- Spatial and Positional Reasoning (SPR): Understanding layout or relative positioning
- Symbolic Substitution (SS): Replacing with numbers, emojis, etc.
- Text Orientation (TO): Interpreting rotated/flipped text
- Text Recognition (TR): Detecting stylized text or fonts
- Visual Metaphors & Cultural References (VMCR): Idioms, memes, or metaphorical representations
See the paper Appendix D for detailed definitions and dataset statistics.
We provide multiple evaluation strategies and scripts:
- Script:
eval/eval_bootstrap.py,eval/eval_human_files.py - Description: Checks for exact (case-insensitive, space-insensitive) string match between model output and ground-truth answer.
- Usage: Set
LLM_AS_JUDGE = Falsein the script or config.
- Script:
eval/eval_bootstrap.py,eval/eval_human_files.py - Description: Uses GPT-4o (or other LLM) to judge whether the model prediction is semantically equivalent to the ground truth, allowing for minor spelling/formatting errors.
- Usage: Set
LLM_AS_JUDGE = Truein the script or config. The LLM is prompted to respond with only "yes" or "no" for each prediction.
- Script:
scripts/compute_clip_recall.py,scripts/summarize_clip_results.py - Metrics: Recall@K, Precision@1, MRR, NDCG, etc.
- Description: Evaluates retrieval performance using CLIP or similar models. Summarize results across multiple models with the summary script.
- Configurable via: YAML files in
conf/ - Description: Evaluate performance by skill category, with in-context learning (ICL), skill-guided prompts, caption-only, or iterative refinement.
- Script:
eval/eval_human_files.py,eval/eval_bootstrap.py - Description: Provides mean accuracy and 95% confidence intervals via bootstrapping.
-
Ground Truth Annotation: Use
scripts/annotate_ground_truth.py(Gradio UI) to label answers for each puzzle image. -
Skill Annotation: Use
scripts/annotate_skills.py(Gradio UI) to assign cognitive skill categories to each puzzle.
-
Configure your experiment:
- Edit YAML configs in the
conf/directory to select models, prompts, and evaluation settings.
- Edit YAML configs in the
-
Install dependencies:
pip3 install -r requirements.txt
-
Run the evaluation script:
python3 main.py
- For CLIP retrieval metrics, use:
python3 scripts/compute_clip_recall.py --checkpoint ... --model ... --input_folder ... --ground_truth ... --output ...
- To summarize CLIP results:
python3 scripts/summarize_clip_results.py metrics/
- For CLIP retrieval metrics, use:
-
View results:
- Model predictions, logs, and evaluation metrics will be written to the specified output directory.
- CLIP metrics will be saved as JSON files in the metrics directory.
If you find this work useful, please cite our paper:
@inproceedings{lee2025puzzled,
title = {Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint},
author = {Heekyung Lee and Jiaxin Ge and Tsung-Han Wu and Minwoo Kang and Trevor Darrell and David M. Chan},
year = {2025},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
url = {https://arxiv.org/abs/2505.23759}
}Development supported in part by the National Science Foundation, the Ford Foundation, the BAIR Industrial Alliance, DARPA, and the U.S. Army/AFRL. Special thanks to Lisa Dunlap, XuDong Wang, Konpat Preechakul, Baifeng Shi, and Stephanie Murphy for review and ideation support.