# python recommendation
conda create --name colva-env python=3.10 -y
conda activate colva-env
# transformers requirements
# When using InternVL2 as the base model, transformers>=4.37.2 works normally.
# When using Qwen2VL as the base model, we recommend installing the latest version of transformers:
pip install git+https://github.com/huggingface/transformers
# xtuner requirements
pip install -U 'xtuner[deepspeed]'
The CoLVA Models (Based on InternVL2 4b) can be found here.
Our MMVM Benchmark is available here. It is specially crafted to measure multimodal LLM's visual matching ability via VQA. The benchmark is seperated into a folder containing 1510 cases and a tsv file. Each of these case folders contains the images and visual prompt annotation file involved in the corresponding conversation. The tsv file contains questions, options, and answers.
├── match_bench
| |── case_xxx
| | |── FRAME00.jpg
| | |── FRAME00_ORI.jpg
| | |── FRAME00.json
| | |── FRAME01_CAND.jpg
| | |── FRAME01_ORI.jpg
| | └── FRAME01_CAND.json
│ └── ...
└── mllm_match_eval_full.tsv
We build the evaluation tool MMVMEvalKit based on VLMEvalKit. To evaluate MLLMs on our MMVM benchmark. You can find the development version of our evaluation tool at here.
Before running evaluation:
- Clone down our MMVMEvalKit.
- Download the
match_bench.zip
andmllm_match_eval_full.tsv
from here and put them under the MMVMEvalKit folder and unzip thematch_bench.zip
. - Evironment requirements follow that of VLMEvalKit.
- Note: Your OpenAI API Key should be setted in the .env file:
# OpenAI API
OPENAI_API_KEY=
OPENAI_API_BASE=
To evaluate the existing MLLMs on MMVM benchmark, e.g. InternVL2-2B, run
python run.py --data MMatch --model InternVL2-2B --verbose
To evaluate CoLVA-InternVL2-4B on MMVM benchmark, download the pretrained weights from here and run
python run.py --data MMatch --model colva_internvl2_4b --verbose
To evaluate CoLVA-Qwen2VL-2B on MMVM benchmark, download the pretrained weights from here and run
python run.py --data MMatch --model colva_qwen2vl_2b --verbose
To evaluate CoLVA-Qwen2VL-7B on MMVM benchmark, download the pretrained weights from here and run
python run.py --data MMatch --model colva_qwen2vl_7b --verbose
CoLVA comprises three components: a pre-trained MLLM (e.g. InternVL2, Qwen2VL), a fine-grained vision expert RADIO, and a RADIO adapter. The training procedure of CoLVA includes two stages: pre-training and supervised fine-tuning (SFT). We freeze the MLLM and RADIO during the pre-train stage, focusing solely on training the RADIO Adapter. During the SFT stage, we freeze the RADIO, the RADIO adapter, and all components of InternVL2-4B except the LLM. The LLM of the MLLM is tuned applying LoRA.
During the pre-training phase, we sample 500k images with segmentation labels from SA1B.
During the fine-tune phase, we utilize the LLaVA SFT data, ShareGPT4o, and our MMVM SFT data.
For the pre-training stage of CoLVA-InternVL2-4B, run with 1 node and 8 ranks:
bash tools/dist.sh train projects/colva/pretrain/internvl2_phi3_4b_radio_align_pretrain.py 8
or run with deepspeed:
bash tools/dist.sh train projects/colva/pretrain/internvl2_phi3_4b_radio_align_pretrain.py 8 --deepspeed deepspeed_zero3
For the fine-tuning stage of CoLVA-InternVL2-4B, run:
bash tools/dist.sh train projects/colva/finetune/internvl2_phi3_4b_radio_match_sft.py 8
Note: You need to set the radio_adapter_weight
to the path of pretrained radio adapter weights in the projects/colva/finetune/internvl2_phi3_4b_radio_match_sft.py
file. For example,
radio_adapter_weight = "./work_dirs/internvl2_phi3_4b_radio_align_pretrain/iter_11792.pth"
This project is under the MIT license. See LICENSE for details.
Please consider citing our paper if you find this project helpful for your research:
@misc{zhou2025sameexploringvisualcorrespondence,
title={Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs},
author={Yikang Zhou and Tao Zhang and Shilin Xu and Shihao Chen and Qianyu Zhou and Yunhai Tong and Shunping Ji and Jiangning Zhang and Xiangtai Li and Lu Qi},
year={2025},
eprint={2501.04670},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.04670},
}