Update [Mar 22, 2023]: We updated our arxiv preprint with the camera-ready version) and also the dataset in this repo to be consistent with the accepted paper. If you used an earlier version of VSR, you can refer to the earlier version of the preprint (v1) and the earlier snapshot of this repo.
Update [Feb 10, 2023]: Check out CLIP_visual-spatial-reasoning by @Sohojoe where you can find CLIP's performance on VSR.
Update [Feb 3, 2023]: Visual Spatial Reasoning is accepted to TACL 🥂! Stay tuned for the camera-ready version!
The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). Below are a few examples.
The cat is behind the laptop. (True) | The cow is ahead of the person. (False) | The cake is at the edge of the dining table. (True) | The horse is left of the person. (False) |
---|---|---|---|
Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.
Below are baselines' by-relation performances on VSR (random split). More data != better performance. The relations are sorted by frequencies from left to right. The VLMs' by-relation performances have little correlation with relation frequency, meaning that more training data do not necessarily lead to better performance.
Understanding object orientation is hard. After classifying spatial relations into meta-categories, we can clearly see that all models are at chance level for "orientation"-related relations (such as "facing", "facing away from", "parallel to", etc.).
For more findings and takeways including zero-shot split performance. check out our paper!
The VSR corpus, after validation, contains 10,972 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if dog is in test set, it is not used for training and development). Below are some basic statistics of the two splits.
split | train | dev | test | total |
---|---|---|---|---|
random | 7,680 | 1,097 | 2,195 | 10,972 |
zero-shot | 4,713 | 231 | 616 | 5,560 |
Check out data/
for more details.
You can also load VSR from huggingface [🤗vsr_random] & [🤗vsr_zeroshot]:
from datasets import load_dataset
data_files = {"train": "train.jsonl", "dev": "dev.jsonl", "test": "test.jsonl"}
dataset = load_dataset("cambridgeltl/vsr_random", data_files=data_files)
Note that the image files still need to be downloaded separately as suggested in data/
.
We test four baselines, all supported in huggingface. They are VisualBERT (Li et al. 2019), LXMERT (Tan and Bansal, 2019), ViLT (Kim et al. 2021), and CLIP (Radford et al. 2021).
model | random split | zero-shot |
---|---|---|
human | 95.4 | 95.4 |
CLIP (frozen) | 56.0 | 54.5 |
CLIP (finetuned)* | 65.1 | - |
VisualBERT | 55.2 | 51.0 |
ViLT | 69.3 | 63.0 |
LXMERT | 70.1 | 61.2 |
*CLIP (finetuned) result is from here.
See data/
folder's readme. Images should be saved under data/images/
.
Depending on your system configuration and CUDA version, you might need two sets of environment: one environment for feature extraction (i.e, "Extract visual embeddings" section below) and one environment for all other experiments. You can install feature extraction environment by running feature_extraction/feature_extraction_environment.sh
(specifically, feature extraction requires detectron2==0.5
, CUDA==11.1
and torch==1.8
). The default configuration for running other things can be found in requirements.txt
.
For VisualBERT and LXMERT, we need to first extract visual embeddings using pre-trained object detectors. This can be done through
bash feature_extraction/lxmert/extract.sh
VisualBERT feature extraction is done similarly by replacing lxmert
with visualbert
. The features will be stored under data/features/
and automatically loaded when running training and evaluation scripts of LXMERT and VisualBERT. The feature extraction codes are modified from huggingface examples here (for VisualBERT) and here (for LXMERT).
scripts/
contain some example bash scripts for training and evaluation. For example, the following script trains LXMERT on the random split:
bash scripts/lxmert_train.sh 0
where 0
denotes device index. Configurations such as checkpoint saving address can be modified in the script.
Similarly, evaluating the obtained LXMERT model can be done by running:
bash scripts/lxmert_eval.sh 0
Configurations such as checkpoint reading address can be modified in the script.
In analysis_scripts/
you can checkout how to print out by-relation and by-meta-category accuracies.
If you find VSR useful:
@article{Liu2022VisualSR,
title={Visual Spatial Reasoning},
author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
journal={Transactions of the Association for Computational Linguistics},
year={2023},
}
This project is licensed under the Apache-2.0 License.