Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Paper | Project Page | MMVM Benchmark | Huggingface

Getting Started

Environment Requirements

# python recommendation
conda create --name colva-env python=3.10 -y
conda activate colva-env

# transformers requirements
# When using InternVL2 as the base model, transformers>=4.37.2 works normally.
# When using Qwen2VL as the base model, we recommend installing the latest version of transformers:
pip install git+https://github.com/huggingface/transformers

# xtuner requirements
pip install -U 'xtuner[deepspeed]'

Pre-trained Model

The CoLVA Models (Based on InternVL2 4b) can be found here.

Benchmarks

MMVM Benchmark

Our MMVM Benchmark is available here. It is specially crafted to measure multimodal LLM's visual matching ability via VQA. The benchmark is seperated into a folder containing 1510 cases and a tsv file. Each of these case folders contains the images and visual prompt annotation file involved in the corresponding conversation. The tsv file contains questions, options, and answers.

├── match_bench
|   |── case_xxx
|   |   |── FRAME00.jpg
|   |   |── FRAME00_ORI.jpg
|   |   |── FRAME00.json
|   |   |── FRAME01_CAND.jpg
|   |   |── FRAME01_ORI.jpg
|   |   └── FRAME01_CAND.json
│   └── ...
└── mllm_match_eval_full.tsv

Evaluation

We build the evaluation tool MMVMEvalKit based on VLMEvalKit. To evaluate MLLMs on our MMVM benchmark. You can find the development version of our evaluation tool at here.

Before running evaluation:

Clone down our MMVMEvalKit.
Download the match_bench.zip and mllm_match_eval_full.tsv from here and put them under the MMVMEvalKit folder and unzip the match_bench.zip.
Evironment requirements follow that of VLMEvalKit.
Note: Your OpenAI API Key should be setted in the .env file:

# OpenAI API
OPENAI_API_KEY=
OPENAI_API_BASE=

To evaluate the existing MLLMs on MMVM benchmark, e.g. InternVL2-2B, run

python run.py --data MMatch --model InternVL2-2B --verbose

To evaluate CoLVA-InternVL2-4B on MMVM benchmark, download the pretrained weights from here and run

python run.py --data MMatch --model colva_internvl2_4b --verbose

To evaluate CoLVA-Qwen2VL-2B on MMVM benchmark, download the pretrained weights from here and run

python run.py --data MMatch --model colva_qwen2vl_2b --verbose

To evaluate CoLVA-Qwen2VL-7B on MMVM benchmark, download the pretrained weights from here and run

python run.py --data MMatch --model colva_qwen2vl_7b --verbose

Training

Training Procedure

CoLVA comprises three components: a pre-trained MLLM (e.g. InternVL2, Qwen2VL), a fine-grained vision expert RADIO, and a RADIO adapter. The training procedure of CoLVA includes two stages: pre-training and supervised fine-tuning (SFT). We freeze the MLLM and RADIO during the pre-train stage, focusing solely on training the RADIO Adapter. During the SFT stage, we freeze the RADIO, the RADIO adapter, and all components of InternVL2-4B except the LLM. The LLM of the MLLM is tuned applying LoRA.

Data Preparation

During the pre-training phase, we sample 500k images with segmentation labels from SA1B.

During the fine-tune phase, we utilize the LLaVA SFT data, ShareGPT4o, and our MMVM SFT data.

Pretrain

For the pre-training stage of CoLVA-InternVL2-4B, run with 1 node and 8 ranks:

bash tools/dist.sh train projects/colva/pretrain/internvl2_phi3_4b_radio_align_pretrain.py 8

or run with deepspeed:

bash tools/dist.sh train projects/colva/pretrain/internvl2_phi3_4b_radio_align_pretrain.py 8 --deepspeed deepspeed_zero3

Fine-tune

For the fine-tuning stage of CoLVA-InternVL2-4B, run:

bash tools/dist.sh train projects/colva/finetune/internvl2_phi3_4b_radio_match_sft.py 8

Note: You need to set the radio_adapter_weight to the path of pretrained radio adapter weights in the projects/colva/finetune/internvl2_phi3_4b_radio_match_sft.py file. For example,

radio_adapter_weight = "./work_dirs/internvl2_phi3_4b_radio_align_pretrain/iter_11792.pth"

License

This project is under the MIT license. See LICENSE for details.

Citation

Please consider citing our paper if you find this project helpful for your research:

@misc{zhou2025sameexploringvisualcorrespondence,
      title={Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs}, 
      author={Yikang Zhou and Tao Zhang and Shilin Xu and Shihao Chen and Qianyu Zhou and Yunhai Tong and Shunping Ji and Jiangning Zhang and Xiangtai Li and Lu Qi},
      year={2025},
      eprint={2501.04670},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.04670}, 
}

Acknowledgement

This work is built upon the InternVL2, Qwen2VL.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
imgs		imgs
projects		projects
tools		tools
vlm		vlm
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Paper | Project Page | MMVM Benchmark | Huggingface

Contents:

Getting Started

Environment Requirements

Pre-trained Model

Benchmarks

MMVM Benchmark

Evaluation

Training

Training Procedure

Data Preparation

Pretrain

Fine-tune

License

Citation

Acknowledgement

About

Releases

Packages

Languages

License

zhouyiks/CoLVA

Folders and files

Latest commit

History

Repository files navigation

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

Paper | Project Page | MMVM Benchmark | Huggingface

Contents:

Getting Started

Environment Requirements

Pre-trained Model

Benchmarks

MMVM Benchmark

Evaluation

Training

Training Procedure

Data Preparation

Pretrain

Fine-tune

License

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages