Skip to content

🧩 Official code repository for “Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint.”

Notifications You must be signed in to change notification settings

Kyunnilee/visual_puzzles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

đź§© Visual Puzzles: Evaluating Vision-Language Model Reasoning with Rebus Puzzles

🔥 Official repository for the project "Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint"

Dataset   arXiv   EMNLP 2025


Table of Contents


Overview

This codebase supports the research and experiments from "Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint".

Authors: Heekyung Lee,Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan (POSTECH & UC Berkeley)

We introduce a probe dataset of 432 hand-annotated English rebus puzzles, each requiring integration of imagery, spatial arrangement, and symbolic reasoning-challenging VLMs far beyond rote image captioning or straightforward question answering.


Motivation

Recent VLMs have excelled at direct visual-text alignment, but fundamental gaps remain in their ability to solve tasks requiring abstract reasoning, compositional thinking, and cultural/phonetic inference. Rebus puzzles-visual riddles encoding phrases or concepts through images, wordplay, and spatial logic-are a demanding testbed for these higher-order cognitive skills.

Our goal: Systematically probe VLMs' capabilities and limitations in visual-linguistic reasoning using a carefully curated and categorized benchmark, with human and model baselines for reference.


Dataset

  • 432 hand-crafted rebus puzzles with curated images and answers
  • Each puzzle is annotated with one or more of 11 cognitive skill categories (see below)
  • Images sourced and quality-checked for consistency
  • Each puzzle includes:
    • Puzzle image
    • Ground truth answer
    • Skill category annotations

Sample puzzle: If the word "WATER" is written in a curved downward shape, the answer is "Waterfall".


Cognitive Skill Categories

Puzzles are annotated by required cognitive skill(s):

  • Absence or Negation (AN): Recognizing missing or negated elements
  • Font Style/Size (FS): Interpreting clues from font differences
  • Image Recognition (IR): Identifying objects, symbols, or people
  • Letter and Word Manipulation (LWM): Overlapping/hiding/repeating letters to form new meanings
  • Phonetics and Wordplay (PW): Solving with homophones or puns
  • Quantitative/Mathematical Reasoning (QMR): Object counting, math symbols, quantitative logic
  • Spatial and Positional Reasoning (SPR): Understanding layout or relative positioning
  • Symbolic Substitution (SS): Replacing with numbers, emojis, etc.
  • Text Orientation (TO): Interpreting rotated/flipped text
  • Text Recognition (TR): Detecting stylized text or fonts
  • Visual Metaphors & Cultural References (VMCR): Idioms, memes, or metaphorical representations

See the paper Appendix D for detailed definitions and dataset statistics.


Evaluation Metrics & Scripts

We provide multiple evaluation strategies and scripts:

1. Naive Matching

  • Script: eval/eval_bootstrap.py, eval/eval_human_files.py
  • Description: Checks for exact (case-insensitive, space-insensitive) string match between model output and ground-truth answer.
  • Usage: Set LLM_AS_JUDGE = False in the script or config.

2. LLM-Judged Evaluation

  • Script: eval/eval_bootstrap.py, eval/eval_human_files.py
  • Description: Uses GPT-4o (or other LLM) to judge whether the model prediction is semantically equivalent to the ground truth, allowing for minor spelling/formatting errors.
  • Usage: Set LLM_AS_JUDGE = True in the script or config. The LLM is prompted to respond with only "yes" or "no" for each prediction.

3. CLIP/Image-Text Retrieval Metrics

4. Skill-Specific and Prompting Evaluations

  • Configurable via: YAML files in conf/
  • Description: Evaluate performance by skill category, with in-context learning (ICL), skill-guided prompts, caption-only, or iterative refinement.

5. Bootstrapped Confidence Intervals


Annotation Tools


How to Run

  1. Configure your experiment:

    • Edit YAML configs in the conf/ directory to select models, prompts, and evaluation settings.
  2. Install dependencies:

    pip3 install -r requirements.txt
  3. Run the evaluation script:

    python3 main.py
    • For CLIP retrieval metrics, use:
      python3 scripts/compute_clip_recall.py --checkpoint ... --model ... --input_folder ... --ground_truth ... --output ...
    • To summarize CLIP results:
      python3 scripts/summarize_clip_results.py metrics/
  4. View results:

    • Model predictions, logs, and evaluation metrics will be written to the specified output directory.
    • CLIP metrics will be saved as JSON files in the metrics directory.

Citing this work

If you find this work useful, please cite our paper:

@inproceedings{lee2025puzzled,
              title        = {Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint},
              author       = {Heekyung Lee and Jiaxin Ge and Tsung-Han Wu and Minwoo Kang and Trevor Darrell and David M. Chan},
              year         = {2025},
              booktitle    = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
              url          = {https://arxiv.org/abs/2505.23759}
            }

🍀 Acknowledgements

Development supported in part by the National Science Foundation, the Ford Foundation, the BAIR Industrial Alliance, DARPA, and the U.S. Army/AFRL. Special thanks to Lisa Dunlap, XuDong Wang, Konpat Preechakul, Baifeng Shi, and Stephanie Murphy for review and ideation support.

About

🧩 Official code repository for “Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint.”

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published