Skip to content

Conversation

@Luodian
Copy link
Contributor

@Luodian Luodian commented Jan 22, 2026

Summary

  • Add MMVP (Multimodal Visual Patterns) benchmark task
  • Apply verified ground truth corrections for indices 99 and 279 as documented in Issues about MMVP Dataset #1018

Description

MMVP is a benchmark that tests VLMs on "CLIP-blind pairs" - images that look similar to CLIP but have clear visual differences. The dataset contains 300 samples (150 pairs) testing 9 basic visual patterns.

Features

  1. Dataset: Loads from MMVP/MMVP on HuggingFace
  2. Metrics:
    • mmvp_accuracy: Individual question accuracy
    • mmvp_pair_accuracy: Both questions in a CLIP-blind pair must be correct (stricter metric)
  3. Ground Truth Corrections: Applies verified corrections for:
    • Index 99: Elephant tusks are long, not short (corrected from B to A)
    • Index 279: Person is standing, not sitting (corrected from B to A)

Usage

python -m lmms_eval --model <model> --tasks mmvp --batch_size 1

References

Add MMVP (Multimodal Visual Patterns) benchmark task that tests VLMs
on CLIP-blind pairs - images perceived as similar by CLIP but with
clear visual differences.

Key features:
- Loads dataset from MMVP/MMVP on HuggingFace
- Reports both individual accuracy and pair accuracy metrics
- Applies verified ground truth corrections for indices 99 and 279
  as documented in issue #1018

The pair accuracy metric requires models to correctly answer BOTH
questions in each CLIP-blind pair, providing a stricter evaluation
of genuine visual understanding.

Github-Issue: #1018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants