A minimal Python scaffold for benchmarking Vision-Language Models (VLMs) on visual reasoning tasks.
Visual Reasoning Bench provides a clean, modular architecture for evaluating VLMs. It includes:
- Datasets: Extensible dataset interface yielding
{id, image_path, question, answer} - Models: Base model class with
predict(image_path, question) β strinterface - Evaluation: Pipeline for running models on datasets and computing accuracy metrics
- Utilities: I/O and image processing helpers
visual-reasoning-bench/
βββ bench/
β βββ datasets/
β β βββ base.py # Base dataset class
β β βββ pathfinder.py # Pathfinder visual reasoning dataset
β βββ models/
β β βββ base.py # Base model interface
β β βββ llava.py # LLaVA model wrapper
β β βββ openrouter.py # OpenRouter vision wrapper
β βββ evaluate/
β β βββ evaluator.py # Evaluation pipeline
β β βββ metrics.py # Accuracy and other metrics
β βββ utils/
β βββ io.py # Config loading, result saving
β βββ images.py # Image loading and preprocessing
βββ scripts/
β βββ run_eval.py # Main evaluation script
βββ configs/
β βββ example.yaml # Example configuration
βββ website/
βββ index.html # Project landing page
git clone https://github.com/serre-lab/visual-reasoning-bench.git
cd visual-reasoning-benchpython scripts/run_eval.py --config configs/example.yaml --verbose--config: Path to YAML configuration file (default:configs/example.yaml)--output: Path to save results (file or directory; default:results)--verbose: Show progress bar during evaluation
The VPTDataset streams directly from Hugging Face (3D-PC/3D-PC). Install the datasets package (included in requirements.txt), then run the OpenRouter config:
configs/vpt_openrouter.yaml: routes prompts through OpenRouter (preferred for hosted VLMs).
Tweak the config to choose hf_config (depth, vpt-basic, or vpt-strategy), pick a split (train, validation, test, human), or set limit for quick smoke tests. The loader automatically uses the dataset-provided prompt/statement when available; for depth it deterministically alternates between βgreen closer than red?β and the inverted phrasing, flipping the ground-truth answer accordingly. Images stay in memory as raw bytes, so any model wrapper that accepts image_bytes can benchmark VPT without extra preprocessing.
Set your OpenRouter credentials and run the config that targets the OpenRouter API (it can list multiple model slugs to benchmark them in series):
export OPENROUTER_API_KEY=sk-your-openrouter-key
python scripts/run_eval.py --config configs/vpt_openrouter.yaml --verboseconfigs/vpt_openrouter.yaml lets you swap model_slug (e.g., openai/gpt-4o-mini, google/gemini-1.5-pro), adjust decoding params, and pass headers such as http_referer or x_title if your OpenRouter account requires them. When multiple models are defined, the runner evaluates each back-to-back and writes a separate JSON file under results/ (or your --output path) per model.
Edit configs/example.yaml to customize your evaluation:
dataset:
name: pathfinder
data_dir: ./data/pathfinder
model:
name: llava
model_path: null
params:
temperature: 0.0
max_tokens: 512All datasets inherit from BaseDataset and must implement _load_data():
from bench.datasets import BaseDataset
class MyDataset(BaseDataset):
def _load_data(self):
self.samples = [
{
'id': 'sample_0',
'image_path': '/path/to/image.png',
'image_bytes': None, # Use raw bytes when no local path exists
'question': 'What do you see?',
'answer': 'A cat'
},
# ... more samples
]All models inherit from BaseModel and must implement predict():
from bench.models import BaseModel
class MyModel(BaseModel):
def predict(self, image_path: str | None, question: str, image_bytes: bytes | None = None) -> str:
# Your inference code here
use_bytes = image_bytes if image_bytes is not None else open(image_path, 'rb').read()
prediction = self.model.generate(use_bytes, question)
return predictionThe evaluator runs a model on a dataset and computes metrics:
from bench.datasets import PathfinderDataset
from bench.models import LLaVAModel
from bench.evaluate import Evaluator
dataset = PathfinderDataset(data_dir='./data/pathfinder')
model = LLaVAModel(model_path='path/to/checkpoint')
evaluator = Evaluator(model=model, dataset=dataset)
results = evaluator.evaluate(verbose=True)
print(f"Accuracy: {results['metrics']['accuracy']:.2%}")- Create a new file in
bench/datasets/ - Inherit from
BaseDataset - Implement
_load_data()method - Register in
bench/datasets/__init__.py
- Create a new file in
bench/models/ - Inherit from
BaseModel - Implement
predict()method - Register in
bench/models/__init__.py
Add metric functions to bench/evaluate/metrics.py:
def compute_f1_score(predictions, ground_truth):
# Your metric implementation
return f1_scoreThis is a scaffold implementation designed to be extended. Key areas for enhancement:
- Dataset Loading: Add proper data loading from various formats
- Model Integration: Integrate actual VLM implementations
- Image Processing: Add PIL/OpenCV for real image operations
- Metrics: Add more evaluation metrics (F1, BLEU, etc.)
- Visualization: Add result visualization tools
MIT License (or specify your license)
If you use this benchmark in your research, please cite:
@software{visual_reasoning_bench,
title={Visual Reasoning Bench},
author={Serre Lab},
year={2024},
url={https://github.com/serre-lab/visual-reasoning-bench}
}